|
Posted by Jukka K. Korpela on 01/08/07 10:08
Scripsit Taras_96:
> the official Chinese character encoding is GB,
You probably mean GB2312, which is a national standard defined in the
People's Republic of China. If you live under Chinese jurisdiction, you may
need to check whether the standard or some other rule really imposes that
encoding on your pages.
I doubt that, though. Standards are usually not enforced by laws.
> and I'm pretty sure that Windows uses GB encoding as well
"Windows" is a trade mark for a wide range of operating systems, which may
each use different encodings. Why would that matter.
> (since
> it is listed as a MAC in the Windows regional options, plus I tried to
> type some character's into notepad2 with the character encoding set on
> UTF-8 and all that came out was boxes).
So? Whatever that means in detail, why would it matter in HTML authoring?
> Because I want everything internal to the website to be in UTF-8, I
> intend on specifying the accept-charset property in my forms as UTF-8.
That would be unsafe, since the accept-charset attribute is poorly
documented, and there does not seem to be much reliable information on its
support in browsers.
The safest way is to make the page (containing the form) UTF-8 encoded,
expect the data to arrive in UTF-8 encoding, and check this using some
heuristics like a hidden field containing unusual characters.
See also "FORM submission and i18n",
http://ppewww.physics.gla.ac.uk/~flavell/charset/form-i18n.html
> What happens when someone either a) types in Chinese (which I assume
> is stored in memory/RAM as GB)
Modern Windows systems use internally UTF-16, no matter what encodings are
used in particular programs. But this doesn't matter; what matters is what
the input method produces and how the browser deals with it. In general,
there is not much you can do about it as an author.
> or b) copies and pastes some Chinese
> characters from a document that does not use UTF-8 encoding and posts
> the form?
The browser is supposed to do the conversion or, rather, the copy & paste
functionality should handle this.
One reason for using UTF-8 is that ultimately users can produce _any_
Unicode character if they just know how to do that. This means that they
can, for example, insert characters that have no representation in GB2312.
What happens then if the page's encoding (and hence the form's encoding) is
GB2312? The specifications are silent. In practice, browsers tend to do odd
things like insert &#number; references. You can handle them in your form
handler, but it's easier to use UTF-8 so that the problem does not arise.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Navigation:
[Reply to this message]
|