|
Posted by Taras_96 on 01/26/07 12:36
> | >
> | > | This to me implies that the function need not know what the bytes
> | > | represent, it operates on the data as a raw byte stream.
> | >
> | > that's correct.
> | >
> |
> | Thus, by this definition, wouldn't strpos NOT be binary safe, since it
> | needs to know something about what is represented by the raw byte
> | stream? In particular, that the byte 0x00 represents the end of a
> | string.
>
> not really. that function does not rely on character encoding for the bytes
> being interpreted. look at strcoll...it does, and for that reason (data
> needs interpreting) it is not considered 'safe'.
>
But (IMO) the function does rely on bytes being interpreted, in
particular the byte 0. I can see how strcmp and strcoll differ in their
implementations (since, in strcoll, a lower byte value doesn't
necessarily imply a higher alphabetic precedence, as it does for
ASCII), but I'm still a bit lost how one is binary-safe and the other
isn't.
> | Going back to my example, say we pass in strpos('a','cat') with the
> | strings encoded in UCS-2.
> | So, in terms of bytes, strpos would be passed in 0x00 0x16 as the first
> | parameter. Because the function imposes some meaning on specific bytes,
> | in particular 0x00, the function would conclude that the first
> | parameter was an empty string. Strpos can't blindly operate on the
> | bytes it receives, it must interpret them to find the end of strings.
>
> no...'00 16' (the letter 'a' in ucs-2) would be seen as ascii character 48
> followed by another 48, followed by the asc char for a space followed by the
> asc char for 1, etc. that's the literal string contents for 'a' in ucs-2. if
> you searched that literal string for 'a', you would find nothing. if you
> converted the string value of '00 16' from ucs-2 then you'd have the letter
> 'a'...and completely different search results. as for blindly searching for
> \0, that's just not what is happening.
>
Hmm, I would still have to disagree with you.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 states that
"An ASCII or Latin-1 file can be transformed into a UCS-2 file by
simply inserting a 0x00 byte in front of every ASCII byte."
That's one example. All other explanations of UCS-2 have agreed with
this. Note that the '0x' just says that the following number is written
in hexidecimal, thus 0x10 actually is 16 in decimal.
'a' in ASCII is 0x61 (or decimal 97).
'a' in UCS-2 is 0x0061 (as said above, we just inserted a 0x00 byte in
front of the ASCII byte)
What you're proposing reminds me of quoted-printable encoding. If it
was the case that 'a', when encoded in UCS-2, was stored in memory as
0x30 30 36 61, which is the ASCII encoding for the string '0061', then
I see how the function strpos would not assume that an empty string was
passed into the first parameter. But I don't think this is the case.
This link (http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8) also
supports my view. It talks about how using UCS-2 would lead the
existing C functions in Unix to not work because 0x00 has a special
meaning in these functions (in particular, to indicate the end of a
string/array). As you have noted, C and PHP both view strings as arrays
of byte values that are null terminated. Thus PHP would have the same
problems described with using UCS-2 strings.
> what is it that you're trying to do. perhaps i can give an example that will
> work and clear up your questions at the same time.
All I'm trying to do is understand what is meant by a function being
'binary-safe'. I initially got onto the topic because I need to write a
Chinese website in PHP, but I think my curiosity has diverged me from
my original course. A clear and concise definition of what a
binary-safe function is along with a few examples of either would be
great.
>
> your first example and this one, as far as strings go, are completely
> different. they both, however, are interpreted one character at a time. in
> this case, a 0 followed by an x, two more zeros, a 1, a 6, then two more
> zeros. the string has no particular meaning. php does not know that it is a
> particular encoded represenation of data (such as ucs-2). you could likewise
> represent 0x001600 in octal format and php would be equally unaware of the
> string's particular meaning.
>
The 0x00 16 00 was supposed to be 0x00 61 00. With the '0x' I'm
indicating that the numbers I am writing are in hex. Looking at it
another way:
first_parameter[0] == 0
first_parameter[1] == 97 (in decimal)
first_parameter[2] == 0 (the null termination)
> this is why you must somehow tell php that a string is to be interpreted a
> certain way...such that the value would then become (or be seen as) the
> letter 'a'. make sense?
This makes sense. What doesn't is the definition(s) of binary-safe :)
Taras
Navigation:
[Reply to this message]
|