|
Posted by Carl Furst on 05/11/05 15:32
Yeah, the solution I use was posted to the user comments on the strtr
command page which is also documented as a better solution than str_replace
except for the one caveat that it will only try and change a character once,
and some of the hex codes on that page don't really work, because the
representation, say, of a MS Word dash (hex: 0x96) is not the same number on
Linux. So if you try and scrub it on the Linux side it won't find it.
Thanks!
Carl
-----Original Message-----
From: Richard Lynch [mailto:ceo@l-i-e.com]
Sent: Wednesday, May 11, 2005 2:14 AM
To: Carl Furst
Cc: php-general@lists.php.net
Subject: Re: [PHP] Strange characters
On Tue, May 10, 2005 9:43 pm, Carl Furst said:
> I have a question about an odd phenomenon. It doesn't have much to do with
> PHP except that I used strtr to solve it, and it maybe that the problem is
> being caused by a setting in PHP, but I would like to get some more
> background info as to why this is happening.
>
> On a typical Windows system, most applications use the windows-1252
> character set. Linux uses UTF-8 or Unicode. The former being an 8 bit set
> and the latter being a 16 bit set.
>
> Well I have a form on a website that has to be able to take in text from
> MSWord and Notepad and the like. If someone has been using "Autoformating"
> in MS Word, the "special characters" get translated into a UTF-8
> equivalent.
> What's odd is that these 8 bit windows characters become 24 bit
> combinations, I think. When I look at the characters in hex they are
> represented by 3 numbers first one always being 0xE2.
Those are non-ASCII "extended" characters well beyond the 8-bit ASCII set.
In particular, Word just *LOVES* to use funky-ass "quote" marks that are
"curly" quotes with some Microsoft-centric format.
If you check the User Contributed notes for str_replace and the like,
you'll find innumerable listings/solutions for replacing all known (by
empirical/evidential analysis) extended MS Word combinations.
> Why is there an 0xE2 beginning the character combination and why does PHP
> translate these characters this way? Is there something you can do to
> minimize them besides writing some kind of character scrubber?
PHP doesn't "translate" them, really.
The HTTP/browser/web-server sent that character, and PHP is just using
what it got.
The fact that that character only means what the user THINKS it means in
Microsoft Word is the fault of MS Word for not educating its users about
ASCII (normal) characters versus "extended" characters. It is unlikely
that you'll get MS to admit this is a problem, since for them, it's a
lock-in feature to keep people from easily converting their data to better
software.
At any rate, you can just snag the code from the PHP website of User
Contributed notes and call it done.
--
Like Music?
http://l-i-e.com/artists.htm
[Back to original message]
|