You are here: Screenscraping UTF-8 characters problem « PHP Programming Language « IT news, forums, messages
Screenscraping UTF-8 characters problem

Posted by Philipp Lenssen on 02/24/07 20:03

Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!


Some relevant parts from the PHP5:
/******************/

header ('Content-type: text/html; charset=utf-8');
....
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );
....


function getTranslation($q, $lang)
{
$out = '';
// the Google page is supposed to be UTF-8 too:
$in = getFileText( "http://google.com/translate_t?langpair=en|" .
urlencode($lang) . "&text=".urlencode($q) );
preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,
$out);

$translation = $out[1]; // garbled!
$translation = trim($translation);
$translation = utf8_encode($translation); // garbled with or
without this line...
return $translation;
}

/******************/

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация