Reply to Re: Tidy using unicode does not validate — HTML

Posted by Jukka K. Korpela on 03/19/07 14:33

Scripsit Andy Dingley:

> There are two UTF-8 encodings: with and without a BOM at the start of
> the file.

No, adding a BOM at the start of a UTF-8 data stream does not turn it into
another encoding, any more than adding some other character does. As the
Unicode FAQ says, it's just a matter of UTF-8 encoded data starting with a
BOM:
http://unicode.org/faq/utf_bom.html#29

> With (sometimes described as "UTF-8Y" in some Windows tools) is
> _obviously_ UTF-8 and so is easier for capable tools to recognise and
> deal with unambiguously.

The practical point is that UTF-8 encoded BOM is a sequence of octets that
is extremely unlikely to arise from anything else than representing BOM in
UTF-8. So, yes, it is an almost certain and a very simple way of recognizing
a file as UTF-8 encoded. If you use a text editor for a UTF-8 file without
BOM, the poor editor has hard time in guessing the encoding and it may have
to ask the user, which generally has no idea of character encodings.

> However you should remember that files in ASCII, ISO-8859-* or UTF-8
> are all equal until you start using non-ASCII characters. If you add a
> BOM to a UTF-8 file, then it is no longer ASCII or ISO-8859-* at all,

That's part of the other side of the "BOM in UTF-8" coin, yes. Besides, if
your document ever gets processed by some simplistic software that expects
everything to be 8-bit characters, it could get rather confused and not
recognize the data as HTML at all. A UTF-8 encoded HTML document without BOM
can be processed smoothly by such software, except of course that it cannot
correctly interpret the _content_ (but it gets all the markup right, except
perhaps CDATA attribute values).

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация