|
Posted by Jukka K. Korpela on 10/15/07 14:34
Scripsit Andy Dingley:
> On 15 Oct, 05:50, "Jukka K. Korpela" <jkorp...@cs.tut.fi> wrote:
>> Sounds like character encoding confusion. Anything that _looks_ like
>> "? " is probably something UTF-8 encoded (or distorted UTF-8)
>> interpreted by some 8-bit encoding.
>
> No, characters in a UTF-8 encoding interpreted by a tool using non-
> UTF-8 encoding will generally generate garbage characters that are
> still displayable
That's what I wrote about, using the (iso-8859-1 encoded) character Â
(letter A with circumflex accent) as in the original question. I wonder what
piece of software munged it, but it wasn't anything I was using.
> (the tool thinks that it received two good
> characters, they just don't mean anything).
Two, three or four.
> Typically it's a pair of
> characters, the first of these is some variant of an accented
> "A"
Yes, at least when the 8-bit encoding is ISO-8859-1.
The combination "Â " also indicates some other error, since the octet
combination C2 20 must not appear in UTF-8 encoded data. We have little way
of knowing what happened, but I'd guess that 20 (which looks like space when
interpreted according to ISO-8859-1) was some octet in the range 80..9F,
maybe something that isn't allocated in windows-1252.
> To get the unrecognizable character "?" displayed,
Which unrecognizable "?"? The question mark is recognizable, and so is the
character "Â", which is what was actually included in the original question.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
[Back to original message]
|