|
Posted by Chung Leong on 11/18/05 22:36
Ewoud Dronkert wrote:
> Chung Leong wrote:
>
> > PCRE validates the string before it runs the expression.
> > If it isn't valid all the way through, then there's no match.
>
> OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii
> of them? So those would also be considered utf8.
The Latin 1 is a subset of Unicode, true enough, with matching
codepoints. But when encoded as UTF-8, characters in the U+00F0 -
U+00FF will become 2 byte sequences. So text in 8859-1 with curly
quotes and such won't be identified as UTF-8. Text with characters only
in the basic Latin range (i.e. ASCII) would be identical to UTF-8.
It's of course possible to construct a text encoded in 8859-1, KOI8-R,
or whatever, that would appear as valid UTF-8. It'd be total gibberish
though. In a UTF-8 byte sequence, a byte with bit-6 on has to be
followed by a byte with bit-6 off. In a 8-bit charset, that means a
separation of at least 32 code points--too far apart to stay within the
alphabet.
[Back to original message]
|