|
Posted by Markus Ernst on 09/29/41 11:27
Hi
I wrote a function that "normalizes" strings for use in URLs in a UTF-8
encoded content administration application. After having removed the accents
from latin characters I try to remove all non-word characters from the
string:
// PCRE syntax:
$string = preg_replace("/([\W]+)/", "-", $string);
// POSIX alternative (mb_string is on):
$string = ereg_replace("[^[:alnum:]]+", "-", $string);
// post-process and return
return urlencode(trim($string, "-"));
Both ways work but remove all non-latin characters. But what I want to do is
remove only the non-word characters of whatever languages, and keep all word
characters regardless if they are Japanese, Hebrew, Arab, Latin or whatever.
Is there a way for a Regex to recognize non-latin word/non-word characters?
Or do I have to manually specify all the characters to be removed?
Thanks for every hint
Markus
[Back to original message]
|