| 
	
 | 
 Posted by Markus Ernst on 09/20/05 22:08 
Hi 
 
In order to make strings suitable for URLs in a UTF-8 encoded website, I use  
2 functions, the first of which removes accents from some Latin-1, Latin-2,  
and Turkish characters (suggestions for changes or additions welcome!), and  
the second removes non-word characters by spaces and then urlencode()s the  
string: 
 
function remove_accents($string, $german=false) { 
    // Single characters 
    $single_fr = explode(" ", "      A A  C C D D      E E G    
  I L L L  N N       O R R S S S T T     U U  Z Z Z       
 a a  c c d d     e e g     i l l l  n n        o r r s s  
s t t     u u   z z z"); 
    $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I  
I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a  
a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s  
s t t u u u u u u y y z z z"); 
    $single = array(); 
    for ($i=0; $i<count($single_fr); $i++) { 
        $single[$single_fr[$i]] = $single_to[$i]; 
    } 
    // Ligatures 
    $ligatures = array(""=>"Ae", ""=>"ae", "O"=>"Oe", "o"=>"oe",  
""=>"ss"); 
    // German umlauts 
    $umlauts = array(""=>"Ae", ""=>"ae", ""=>"Oe", ""=>"oe", ""=>"Ue",  
""=>"ue"); 
    // Replace 
    $replacements= array_merge($single, $ligatures); 
    if ($german) $replacements= array_merge($replacements, $umlauts); 
    $string = strtr($string, $replacements); 
    return $string; 
} 
 
function make_url_string($string) { 
    $string = strtolower(remove_accents($string, true)); 
    $string = preg_replace("/([\W]+)/", "-", $string); 
    return urlencode(trim($string, "-")); 
} 
 
I have 2 questions on this: 
 
1. preg_replace("/([\W]+)/", "-", $string); removes all non-ASCII  
characters. Is there any possibility to remove only punctuation and such  
stuff, but keep all kinds of letters from whatever character sets? 
 
2. Is there a better way to encode strings for URLs? Or is it maybe  
inevitable to collect the real name and the name for the url separately to  
get an ASCII-only entry? 
 
Thanks for suggestions! 
 
--  
Markus
 
  
Navigation:
[Reply to this message] 
 |