|
Posted by Jukka K. Korpela on 06/02/06 05:59
Alan J. Flavell <flavell@physics.gla.ac.uk> scripsit:
> Yes, there seem to be three bytes there: d4 aa f8. I can't help
> worrying that they started life as a utf-8 BOM (ef bb bf), and have
> been mapped through whatever misguided encoding coversion has
> scrambled the rest of the content.
Well spotted.
> Oh yes, A.Prilop is going to love this!! That's exactly what happens
> when one passes ef bb bf through Mr. Pirard's old Mac -> iso-8859-1
> conversion table from 1992.
Sounds quite plausible under the circumstances.
> Hmmm yes, if I take the first 6 bytes of the document title: ad fc 8b
> c4 ad bd, and run them back through Pirard's table, I get d0 9f d1 80
> d0 b8 , which is the utf-8 representation of the three Cyrillic
> letters for "Pri" (I'm not going to try to put cyrillic letters into
> this posting!). Going on a bit further, I make it out to be
> "Privetst...", does that make some kind of sense?
Surely, it's the start of a Russian word that means 'greeting'. (Of course,
using such words in a document title is waste of precious real estate, but I
digress.)
> However, I think I'd prefer to start again from fresh materials!!
Me too. And using UTF-8 for Russian isn't particularly efficient. Using e.g.
windows-1251, you have one octet (byte) for each character. Using UTF-8, you
have one octet for each character in the Ascii range (including characters
used in HTML markup) but two octets for each Cyrillic letter. UTF-8 would be
fine if the document contained, say, a mixture of Russian and French.
--
Yucca, http://www.cs.tut.fi/~jkorpela/
[Back to original message]
|