|
Posted by Jukka K. Korpela on 03/19/07 14:33
Scripsit Andy Dingley:
> There are two UTF-8 encodings: with and without a BOM at the start of
> the file.
No, adding a BOM at the start of a UTF-8 data stream does not turn it into
another encoding, any more than adding some other character does. As the
Unicode FAQ says, it's just a matter of UTF-8 encoded data starting with a
BOM:
http://unicode.org/faq/utf_bom.html#29
> With (sometimes described as "UTF-8Y" in some Windows tools) is
> _obviously_ UTF-8 and so is easier for capable tools to recognise and
> deal with unambiguously.
The practical point is that UTF-8 encoded BOM is a sequence of octets that
is extremely unlikely to arise from anything else than representing BOM in
UTF-8. So, yes, it is an almost certain and a very simple way of recognizing
a file as UTF-8 encoded. If you use a text editor for a UTF-8 file without
BOM, the poor editor has hard time in guessing the encoding and it may have
to ask the user, which generally has no idea of character encodings.
> However you should remember that files in ASCII, ISO-8859-* or UTF-8
> are all equal until you start using non-ASCII characters. If you add a
> BOM to a UTF-8 file, then it is no longer ASCII or ISO-8859-* at all,
That's part of the other side of the "BOM in UTF-8" coin, yes. Besides, if
your document ever gets processed by some simplistic software that expects
everything to be 8-bit characters, it could get rather confused and not
recognize the data as HTML at all. A UTF-8 encoded HTML document without BOM
can be processed smoothly by such software, except of course that it cannot
correctly interpret the _content_ (but it gets all the markup right, except
perhaps CDATA attribute values).
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
[Back to original message]
|