Reply to Re: How to determine if a file is UTF8 encoded? — PHP Programming Language

Posted by Chung Leong on 11/18/05 22:36

Ewoud Dronkert wrote:
> Chung Leong wrote:
>
> > PCRE validates the string before it runs the expression.
> > If it isn't valid all the way through, then there's no match.
>
> OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii
> of them? So those would also be considered utf8.

The Latin 1 is a subset of Unicode, true enough, with matching
codepoints. But when encoded as UTF-8, characters in the U+00F0 -
U+00FF will become 2 byte sequences. So text in 8859-1 with curly
quotes and such won't be identified as UTF-8. Text with characters only
in the basic Latin range (i.e. ASCII) would be identical to UTF-8.

It's of course possible to construct a text encoded in 8859-1, KOI8-R,
or whatever, that would appear as valid UTF-8. It'd be total gibberish
though. In a UTF-8 byte sequence, a byte with bit-6 on has to be
followed by a byte with bit-6 off. In a 8-bit charset, that means a
separation of at least 32 code points--too far apart to stay within the
alphabet.

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация