Reply to Regular expression: non-latin word/non-word characters and UTF-8 — All PHP

Posted by Markus Ernst on 02/22/41 11:27

Hi

I wrote a function that "normalizes" strings for use in URLs in a UTF-8
encoded content administration application. After having removed the accents
from latin characters I try to remove all non-word characters from the
string:

// PCRE syntax:
$string = preg_replace("/([\W]+)/", "-", $string);

// POSIX alternative (mb_string is on):
$string = ereg_replace("[^[:alnum:]]+", "-", $string);

// post-process and return
return urlencode(trim($string, "-"));

Both ways work but remove all non-latin characters. But what I want to do is
remove only the non-word characters of whatever languages, and keep all word
characters regardless if they are Japanese, Hebrew, Arab, Latin or whatever.

Is there a way for a Regex to recognize non-latin word/non-word characters?
Or do I have to manually specify all the characters to be removed?

Thanks for every hint
Markus

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация