|
Posted by Kimmo Laine on 08/30/06 12:25
"Peter Mόnster" <look@signature.invalid> wrote in message
news:Pine.LNX.4.64.0608301400400.871@gaston.deltadore.bzh...
> On Wed, 30 Aug 2006, Kimmo Laine wrote:
>
>> That might be a multibyte-string related problem. If the string is
>> encoded
>> using multibyte charset, such as utf-8, it could be the reason
>> str_word_count is confused.
>
> Yes, you're right: I've just tried with fr_FR.iso885915 and it works.
That's great. :)
>> Once you've installed multibyte library, you could try writing a regular
>> expression for counting the words and use it with the mb_ereg* functions.
>
> Thanks for the hint. As a workaround I use already a regular expression to
> get the words, but str_word_count() is still better than my solution:
> str_word_count() detects constructs like "it's" and "week-end" etc.
There were some examples of regexp substitutions for str_word_count in the
php.net manualpage, in the user contributions. You might want to check them.
For example rcATinterfacesDOTfr suggests that
$word_count = count(preg_split('/\W+/', $text, -1,
PREG_SPLIT_NO_EMPTY));
should work. The advantage in this solution is that there is mb_eregi_split
as well, wo you could use this with the mb-functions if you wanted to use
utf-8.
I try to enforce utf-8 whenever it is possible simply because of it's
advantages in an international multilingual communication even thou it has
it's disadvantages as well.
--
"Ohjelmoija on organismi joka muuttaa kofeiinia koodiksi" - lpk
http://outolempi.net/ahdistus/ - Satunnaisesti pδivittyvδ nettisarjis
spam@outolempi.net || Gedoon-S @ IRCnet || rot13(xvzzb@bhgbyrzcv.arg)
[Back to original message]
|