|  | Posted by Markus Ernst on 09/20/05 22:15 
Sorry for the multipost - I forgot to crosspost and alt.php gets less attention than comp.lang.php... And I hope this will work with UTF-8.
 
 In order to make strings suitable for URLs in a UTF-8 encoded website, I use
 2 functions, the first of which removes accents from some Latin-1, Latin-2,
 and Turkish characters (suggestions for changes or additions welcome!), and
 the second removes non-word characters by spaces and then urlencode()s the
 string:
 
 function remove_accents($string, $german=false) {
 // Single characters
 $single_fr = explode(" ", "      A A  C C D D      E E G
 I L L L  N N       O R R S S S T T     U U  Z Z Z
 a a  c c d d     e e g     i l l l  n n        o r r s s
 s t t     u u   z z z");
 $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
 I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
 a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
 s t t u u u u u u y y z z z");
 $single = array();
 for ($i=0; $i<count($single_fr); $i++) {
 $single[$single_fr[$i]] = $single_to[$i];
 }
 // Ligatures
 $ligatures = array(""=>"Ae", ""=>"ae", "O"=>"Oe", "o"=>"oe",
 ""=>"ss");
 // German umlauts
 $umlauts = array(""=>"Ae", ""=>"ae", ""=>"Oe", ""=>"oe", ""=>"Ue",
 ""=>"ue");
 // Replace
 $replacements= array_merge($single, $ligatures);
 if ($german) $replacements= array_merge($replacements, $umlauts);
 $string = strtr($string, $replacements);
 return $string;
 }
 
 function make_url_string($string) {
 $string = strtolower(remove_accents($string, true));
 $string = preg_replace("/([\W]+)/", "-", $string);
 return urlencode(trim($string, "-"));
 }
 
 I have 2 questions on this:
 
 1. preg_replace("/([\W]+)/", "-", $string); removes all non-ASCII
 characters. Is there any possibility to remove only punctuation and such
 stuff, but keep all kinds of letters from whatever character sets?
 
 2. Is there a better way to encode strings for URLs? Or is it maybe
 inevitable to collect the real name and the name for the url separately to
 get an ASCII-only entry?
 
 Thanks for suggestions!
 
 --
 Markus
  Navigation: [Reply to this message] |