|
Posted by "Richard Lynch" on 12/15/05 02:26
I have a table like this:
artist_id | artistname | artistname_alpha
1 | The Doors |
2 | The The |
3 | 100 Monkeys |
4 | 3�16 |
That last artistname is not in ASCII/English... Dunno what your email
client is showing you, but it's:
the digit 3
capital A with umlauts
US cents sign
capital A with carat
question mark
capital A with carat
US cents sign
the digit 1
the digit 6
THAT ought to get through any email client/mta okay. :-)
Now, my goal is to fill in artistname_alpha with things such as:
Doors, The
The, The
one hundred monkeys
3�16 (???)
I've written a nifty function for this:
function alpha ($string){
//$string = utf8_decode($string);
$string = preg_replace_callback('/(\\$[0-9\\.]+)/',
create_function('$s', 'return
Numbers_Words::toCurrency(str_replace("$", "", $s[1]));'), $string);
$string = preg_replace_callback('/([0-9]+)/', create_function('$s',
'return Numbers_Words::toWords($s[1]);'), $string);
if (stristr(substr($string, 0, 4), 'The ')) return (substr($string,
4) . ', ' . substr($string, 0, 4));
elseif (stristr(substr($string, 0, 3), 'An ')) return
(substr($string, 3) . ', ' . substr($string, 0, 3));
elseif (stristr(substr($string, 0, 2), 'A ')) return
(substr($string, 2) . ', ' . substr($string, 0, 2));
else return $string;
}
Now, the tricky part is that I don't really know what
'3�16' is.
It looks like it might be UTF-8, but utf8_decode() had no effect on
it, which is why I've commented that out in the function.
SO my function currently converts it to:
'three�sixteen'
That ain't right.
So, does anybody who understands this i18n stuff want to clue me in
the right direction?...
Things you should know:
I'm not trying to provide support for anything but English here,
unless it's trivial to do so.
The table has 150,000 rows.
I have no real control over fancy MySQL settings, as it's a $20 shared
host deal.
Every day, at 6 am, I get a new file of this data, and run through
with a script that does an UPDATE or INSERT. REPLACE is not suitable
due to primary key field size of source data. Anyway, I haven't even
checked if the function as-is will be too slow, but whatever I do to
fix the i18n issue can't have too much overhead, as it will be called
150,000 times every morning at 6 am.
If it helps, here is what my data-source dumps out when he encounters
this band name:
http://cdbaby.com/cd/316live
Here is the band's web-site:
http://316live.com/
And, here, possibly, is HTML source for what somebody copied/pasted
into the FORM to fill in the band name:
3·16
So, possibly, this is not i18n at all, and just somebody really really
really silly copying and pasting an HTML entity 'middot' from their
website into a form input and expecting it to render...
Would '·' output by a browser turn into 'âÂ�¢' ???
If so, what can I do about it?
--
Like Music?
http://l-i-e.com/artists.htm
[Back to original message]
|