|  | Posted by "Richard Lynch" on 12/15/05 02:26 
I have a table like this:artist_id | artistname  | artistname_alpha
 1         | The Doors   |
 2         | The The     |
 3         | 100 Monkeys |
 4         | 3�16   |
 
 That last artistname is not in ASCII/English...  Dunno what your email
 client is showing you, but it's:
 
 the digit 3
 capital A with umlauts
 US cents sign
 capital A with carat
 question mark
 capital A with carat
 US cents sign
 the digit 1
 the digit 6
 
 THAT ought to get through any email client/mta okay. :-)
 
 Now, my goal is to fill in artistname_alpha with things such as:
 Doors, The
 The, The
 one hundred monkeys
 3�16 (???)
 
 I've written a nifty function for this:
 
 function alpha ($string){
 //$string = utf8_decode($string);
 
 $string = preg_replace_callback('/(\\$[0-9\\.]+)/',
 create_function('$s', 'return
 Numbers_Words::toCurrency(str_replace("$", "", $s[1]));'), $string);
 $string = preg_replace_callback('/([0-9]+)/', create_function('$s',
 'return Numbers_Words::toWords($s[1]);'), $string);
 
 if (stristr(substr($string, 0, 4), 'The ')) return (substr($string,
 4) . ', ' . substr($string, 0, 4));
 elseif (stristr(substr($string, 0, 3), 'An ')) return
 (substr($string, 3) . ', ' . substr($string, 0, 3));
 elseif (stristr(substr($string, 0, 2), 'A ')) return
 (substr($string, 2) . ', ' . substr($string, 0, 2));
 else return $string;
 }
 
 Now, the tricky part is that I don't really know what
 '3�16' is.
 
 It looks like it might be UTF-8, but utf8_decode() had no effect on
 it, which is why I've commented that out in the function.
 
 SO my function currently converts it to:
 'three�sixteen'
 
 That ain't right.
 
 So, does anybody who understands this i18n stuff want to clue me in
 the right direction?...
 
 Things you should know:
 
 I'm not trying to provide support for anything but English here,
 unless it's trivial to do so.
 
 The table has 150,000 rows.
 
 I have no real control over fancy MySQL settings, as it's a $20 shared
 host deal.
 
 Every day, at 6 am, I get a new file of this data, and run through
 with a script that does an UPDATE or INSERT.  REPLACE is not suitable
 due to primary key field size of source data.  Anyway, I haven't even
 checked if the function as-is will be too slow, but whatever I do to
 fix the i18n issue can't have too much overhead, as it will be called
 150,000 times every morning at 6 am.
 
 If it helps, here is what my data-source dumps out when he encounters
 this band name:
 http://cdbaby.com/cd/316live
 
 Here is the band's web-site:
 http://316live.com/
 
 And, here, possibly, is HTML source for what somebody copied/pasted
 into the FORM to fill in the band name:
 
 3·16
 
 So, possibly, this is not i18n at all, and just somebody really really
 really silly copying and pasting an HTML entity 'middot' from their
 website into a form input and expecting it to render...
 
 Would '·' output by a browser turn into 'âÂ�¢' ???
 
 If so, what can I do about it?
 
 --
 Like Music?
 http://l-i-e.com/artists.htm
  Navigation: [Reply to this message] |