|
Posted by malatestapunk on 11/14/46 11:57
Jerry Stuckle wrote:
> Petr Vileta wrote:
> > "Jerry Stuckle" <jstucklex@attglobal.net> wrote in
> > news:obmdnXcAL5-qLGPZnZ2dnUVZ_qidnZ2d@comcast.com...
> >
> >> Yes, the browsers convert the characters. But what does the character
> >> "â" mean in a Word document or a pdf? Is it a left or right quote? A
> >> bullet? Something else?
> >>
> >> That's what he needs to know, not the utf-8 codes.
> >>
> > If you see "â" in Word or Acrobat (reader) then this is a character. If
> > you see quote or bullet then this is a quote or bullet. Value of
> > character code is not important at this moment. IMHO user must cut text
> > in Word or Acrobat and paste in browser and at this moment all clipboard
> > content is converted to UTF-8 or UTF-16 as you have defined in html
> > (php) page.
> > I have many pages where users paste text into <texarea> from Word,
> > Acrobat, Corel and more applications and I have no problem with
> > converting characters because my pages are defined as unicode (UTF-16)
> > and my database too. For my pages is irrelevant if user is Czech,
> > English, German, Japanese, Russian or Martian :-)
> > My recommendations is: if you have problem with charsets, use unicode
> > (UTF-16) at all.
>
> The problem is it is not an "â" in Word or a pdf. It could be the
> internal code for a left or right double quote, a bullet, or whatever.
> The browser cannot convert these characters - it has no idea what an "â"
> really is. All it knows is the UTF-8 or whatever code is.
Actually, in my experience *and* in this context, this is not quite
true. When you select and copy a piece of text from
Acrobat/Word/Whatever, then the text iteslf gets copied to the
clipboard, not it's internal repersentation from the original
application (if that would be the case, you could never paste it as
plain text in the first place). As Petr said:
> > If you see "â" in Word or Acrobat (reader) then this is a character. If
> > you see quote or bullet then this is a quote or bullet. Value of
> > character code is not important at this moment.
You can try it yourself by setting the encoding of your plain text
editor to UTF-8, then copying some bullets and other special characters
from Word and pasting them in your editor. Provided you have an
appropriate font for the used UTF-8 glyphs, you should see all of it
properly (including bullets and such).
Which brings out the issue you have with vi displaying garbage. Perhaps
your console just don't use a font that has the needed glyphs - I had
problems myself with vi (elvis, actually) and UTF-8/cp1250/cp1252
texts.
Anyway, this still doesn't solve your main problem - uploading these
characters to the database. Firstly, you should not rely on the META
tag alone to do the work - you should send an appropriate header. Put
something like this in your script before any of your output - header
("Content-type: text/html; charset: utf-8").
You should do this because if you don't, then your http server sends
this header for you, and that header may contain different charset
information. If the charset information is present in the header,
browsers will disregard the charset set in the meta tag of the document
(as indicated in html specifications). Also, if you use utf16, you
might want to send BOM character before everything else (this one I
haven't tried personally, but it's recomended in the html
specification).
Also, you should use the 'accept-charset' attribute on the form tag.
This aditionally specifies what charset your script expects from the
form, and most browsers will do their best to indulge it.
In my experience, you shouldn't rely on only one of these - it's best
that you use all three ways to specify the encoding (header, meta tag
and accept-charset attribute).
I hope this helps,
Vladislav
Navigation:
[Reply to this message]
|