Reply to Re: How to determine if a file is UTF8 encoded?

Your name:

Reply:


Posted by Chung Leong on 11/18/05 22:36

Ewoud Dronkert wrote:
> Chung Leong wrote:
>
> > PCRE validates the string before it runs the expression.
> > If it isn't valid all the way through, then there's no match.
>
> OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii
> of them? So those would also be considered utf8.

The Latin 1 is a subset of Unicode, true enough, with matching
codepoints. But when encoded as UTF-8, characters in the U+00F0 -
U+00FF will become 2 byte sequences. So text in 8859-1 with curly
quotes and such won't be identified as UTF-8. Text with characters only
in the basic Latin range (i.e. ASCII) would be identical to UTF-8.

It's of course possible to construct a text encoded in 8859-1, KOI8-R,
or whatever, that would appear as valid UTF-8. It'd be total gibberish
though. In a UTF-8 byte sequence, a byte with bit-6 on has to be
followed by a byte with bit-6 off. In a 8-bit charset, that means a
separation of at least 32 code points--too far apart to stay within the
alphabet.

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация