|
Posted by Taras_96 on 01/10/07 04:07
Hi,
Firstly, does the document imply that POST should be used over GET
because POST can specify the incoming character encoding (although it
says that some agents might get confused by the specification). This of
course takes into account the fact that POST may resultingly be used
incorrectly (it may be used for transactions that are indempotent).
> You probably mean GB2312, which is a national standard defined in the
> People's Republic of China. If you live under Chinese jurisdiction, you may
> need to check whether the standard or some other rule really imposes that
> encoding on your pages.
It's not conformance to standards I'm worried about. The ubiquitous
encoding in China is GB2312 - that's what I'm worried about. I've read
the page you linked before (a while ago), and I remember that this
paragraph caught my attention:
"In addition to these considerations, some users may be typing-in or
pasting-in text from an application that uses their local character
coding (practical examples being macRoman on a Mac; or MS-DOS CP850
being copied out of a DOS window on an MS Windows PC), into a text
field of a document that used the author's - different - character
encoding (let's say for the simplest example, iso-8859-1): the user
might then submit the form, disregarding that what they are seeing in
the text field is not what they intended to send. From anecdotal
evidence it appears that some folks analyzing survey responses expected
%xx-representations of 8-bit-coded characters, but sometimes got
clusters of %xx-representations which turned out to be utf-8 instead:
whether this would have been evident or not to the person doing the
submitting was unclear.
Another commonly observed behaviour on Windows platforms is using a
form which is in an iso-8859 coding, but the user pasting in characters
(such as clever-quotes, trademark, euro sign etc.) which only exist in
the corresponding Windows coding, e.g for Latin-1 the codings would be
respectively iso-8859-1 and Windows-1252; in the iso-8859 encodings,
these character positions do not represent displayable characters (they
are in a range reserved for control functions). Some browsers disregard
the mismatch and simply submit the character as the corresponding %xx
code in the range %80-%9F, as if the browser thought it was handling
the Windows coding instead: some replace these inappropriate characters
by some kind of useful (e.g clever-quote replaced by plain quote) or
useless (e.g all unrepresentable characters replaced by question-mark)
substitute; for MSIE5's surprising behaviour see later in this page. "
This implies to me that, as of current, copying and pasting into text
documents (which I'm assuming users will do) from say, a word document,
into a browser text field, can create problems. To avoid/mitigate these
problems, I was thinking of matching the form's encoding to the
encoding that it used in say word documents, to minimise the risk of
some type of conversion mistake. This is why I was interested to see
what encoding Windows uses, and seeing that it wasn't UTF-8, and GB2312
was mentioned as the standard in China (and as I mentioned, the
majority of websites in China are delivered using GB2312), I guessed
that the encoding used in Windows for Chinese characters might be
GB2312. Thus, if a Chinese user copied and pasted from Word (which, in
this hypothetical situation, is using GB2312) into a browser whose form
is encoded in GB2312, then the possibility of some kind of error
occurring is minimised. As you have noted GB2312 has a couple of
problems. Firstly, it doesn't cover all of unicode. For this I was
thinking of using GB18030, as this is a UTF, and is comptable with GBK,
which is an extension of GB2312. I am not sure about GB18030, as I
haven't found a clear reference to whether it is a code table or an
encoding (many sources refer to GB2312 as an encoding, including
Mozilla FF, even though it seems to be a code table), and the encoding
would have to be the same as GB2312 for characters that are present in
both repertoires, in the same way the encoding for UTF-8 and ASCII are
the same for the characters that are present in both sets. However,
another problem with using a GB character set (and associated encoding)
is that PHP does not support these encodings internally. To fix this I
was going to use PHP's http input/output conversion functions, storing
everything internally as UTF-8, and only converting upon output.
A problem with UTF-8 is that it isn't supported everywhere by, say for
instance, mobile phones at the moment. The risk of this higher in
China, where the official standards is GB18030, and most people seem to
be using GB2312.
>
> "Windows" is a trade mark for a wide range of operating systems, which may
> each use different encodings. Why would that matter.
>
....
>
> So? Whatever that means in detail, why would it matter in HTML authoring?
>
See above
> That would be unsafe, since the accept-charset attribute is poorly
> documented, and there does not seem to be much reliable information on its
> support in browsers.
>
OK
>
> Modern Windows systems use internally UTF-16, no matter what encodings are
> used in particular programs. But this doesn't matter; what matters is what
> the input method produces and how the browser deals with it. In general,
> there is not much you can do about it as an author.
>
>
> The browser is supposed to do the conversion or, rather, the copy & paste
> functionality should handle this.
>
So if I copy and paste from a Windows document, or from a text document
encoded in UTF-16 for example, into a form whose encoding is UTF-8,
will:
a) the copy and paste function do the conversion
b) the browser do the conversion when the data is sent
c) the conversion not occur
?
> One reason for using UTF-8 is that ultimately users can produce _any_
> Unicode character if they just know how to do that. This means that they
> can, for example, insert characters that have no representation in GB2312.
> What happens then if the page's encoding (and hence the form's encoding) is
> GB2312? The specifications are silent. In practice, browsers tend to do odd
> things like insert &#number; references. You can handle them in your form
> handler, but it's easier to use UTF-8 so that the problem does not arise.
>
That's why I was going to use GB18030 (if the encoding is the same as
those characters in GB1232)
I may be on the wrong track with my ideas, but this is what I've pieced
together from the resources out there.
Taras
Navigation:
[Reply to this message]
|