| 
	
 | 
 Posted by Chung Leong on 11/18/05 22:36 
Ewoud Dronkert wrote: 
> Chung Leong wrote: 
> 
> > PCRE validates the string before it runs the expression. 
> > If it isn't valid all the way through, then there's no match. 
> 
> OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii 
> of them? So those would also be considered utf8. 
 
The Latin 1 is a subset of Unicode, true enough, with matching 
codepoints. But when encoded as UTF-8, characters in the U+00F0 - 
U+00FF will become 2 byte sequences. So text in 8859-1 with curly 
quotes and such won't be identified as UTF-8. Text with characters only 
in the basic Latin range (i.e. ASCII) would be identical to UTF-8. 
 
It's of course possible to construct a text encoded in 8859-1, KOI8-R, 
or whatever, that would appear as valid UTF-8. It'd be total gibberish 
though. In a UTF-8 byte sequence, a byte with bit-6 on has to be 
followed by a byte with bit-6 off. In a 8-bit charset, that means a 
separation of at least 32 code points--too far apart to stay within the 
alphabet.
 
[Back to original message] 
 |