Re: Screenscraping UTF-8 characters problem — PHP Programming Language

You are here: Re: Screenscraping UTF-8 characters problem « PHP Programming Language « IT news, forums, messages

Posted by Kimmo Laine on 02/24/07 22:28

Philipp Lenssen kirjoitti:
> Hi! I'm having some problems correctly screenscraping and outputting
> e.g. Chinese characters from a Google translator search result. The
> output is always a garbled mess, not Chinese characters. German for
> instance works fine. Thanks for any hints...!!
>
>
> Some relevant parts from the PHP5:
> /******************/
>
> header ('Content-type: text/html; charset=utf-8');
> ...
> showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
> ...
>
>
> function getTranslation($q, $lang)
> {
> $out = '';
> // the Google page is supposed to be UTF-8 too:
> $in = getFileText( "http://google.com/translate_t?langpair=en|" .
> urlencode($lang) . "&text=".urlencode($q) );
> preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
> $out);
>
> $translation = $out[1]; // garbled!
> $translation = trim($translation);
> $translation = utf8_encode($translation); // garbled with or
> without this line...
> return $translation;
> }
>
> /******************/
>

Seems to me what you need are the multibyte functions. You should
replace the preg_match with the multibyte compatible mb_ereg_match:

http://fi2.php.net/manual/en/function.mb-ereg-match.php

Note that mb-functions aren't included in the default installation, you
need to add them, check the instructions for installing:
http://fi2.php.net/manual/en/ref.mbstring.php

--
"En ole paha ihminen, mutta omenat ovat elinkeinoni." -Perttu Sirviö
spam@outolempi.net | Gedoon-S @ IRCnet | rot13(xvzzb@bhgbyrzcv.arg)

Navigation:

Next in forum: 'slider' applets
Prev in forum: Responding to Exceptions
Thread view: Re: Screenscraping UTF-8 characters problem

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация