|
Posted by Andy Hassall on 05/24/05 01:10
On 23 May 2005 14:06:21 -0700, lkrubner@geocities.com wrote:
>Last year I asked a bunch of questions about character encoding on this
>newsgroup. All the answers came down to using ord() in creative ways to
>try to make guesses about multi-byte characters. I was a little amazed
>at this and wondered if I'd somehow misunderstood the situation.
Well - your questions, if I recall, were less about PHP supporting multibyte
strings, but rather you were receiving strings from external sources with no
well-defined encoding, or worse they were coming in with an encoding different
from that defined by the originating page (the main current browsers handle
this badly) and so you were forced to try heuristics to identify the unknown
encoding of a series of bytes.
Once you know what encoding a string is in, then PHP has wide support for
character set encodings.
>I'm pleased to find that Joel Spolsky shared my amazement and offered
>some criticism of PHP on these grounds: "When I discovered that the
>popular web development tool PHP has almost complete ignorance of
>character encoding issues, blithely using 8 bits for characters, making
>it darn near impossible to develop good international web applications,
>I thought, enough is enough."
>
>But his essay is a year older than even the questions I had last year.
>So I'm left wondering, is any work being done to fix the situation? I
>just looked at http://us2.php.net/manual/en/ref.strings.php and saw no
>new functions for handling multi-byte characters. Is anything being
>done on this front?
That's because they're all in the Multibyte String section.
http://uk.php.net/mbstring
>And why aren't a lot of people asking these questions? Once again I'm
>wondering if perhaps I've misunderstood something, somewhere. Isn't
>this an issue that effects pretty much all of us using PHP on the web?
>How are any of the people reading this post dealing with their own
>character encoding issues?
>
>Joel Spolsky's essay is here:
>
>http://www.joelonsoftware.com/articles/Unicode.html
The one key sentence in there is:
"It does not make sense to have a string without knowing what encoding it
uses."
Absolutely.
PHP's "string" datatype is a bit of a misnomer; it's more like a "series of
bytes" datatype. The "plain" string functions, as in C, assume a single byte
encoding, and are pretty dumb about the mapping between that and characters.
Where there's any significance, some functions take a character set encoding
parameter, or default to ISO-8859-1. You have to keep track of what encoding
you're storing in strings.
mbstring puts a bit more intelligence into it, since it knows about more
character set encodings, e.g. it can give you counts of characters for
multibyte encoded strings, or convert between encodings. But you still need to
know what encoding each string is in.
Multibyte strings are still second-class citizens in PHP, but saying it has no
support for them is just wrong, mbstring has been around for ages. There's even
an option (mbstring.func_overload) that replaces the builtin single-byte
functions with multibyte-aware equivalents.
http://uk.php.net/manual/en/ref.mbstring.php#mbstring.overload
You can still work with UTF-8 strings without mbstring, anyway. It just
depends what operations you perform on them. Concatenation is unaffected, as is
printing. Counting characters requires a multibyte aware function, but if you
never use strlen() on the strings, it doesn't matter what encoding they're in.
If you want regular expressions, then the PCRE regexes have the "u" modifier
that treats the input as UTF-8.
So it all looks pretty well covered.
Perl only recently (in 5.8) finished the transition to natively supporting
utf8 strings (a process that began a long time ago). Strings in Perl are now
either a series of bytes of undefined encoding (i.e. C or PHP-style strings),
or have a utf8 flag set indicating they're UTF-8 encoded, which the builtin
string functions are aware of and so return the correct results in terms of
characters.
That's one step up from PHP, since strings carry around some metadata with
them on their encoding, at least if they're UTF-8.
--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Navigation:
[Reply to this message]
|