Reply to Weird loadHTML behaviour — All PHP

Posted by monochromec on 05/09/07 11:16

Hi all,

I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:

$str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');

file_put_contents ('dmp.htm', $str);

$dom = DOMDocument::loadHTML ($str);
$elem = $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n = $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));

What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5> tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ" (0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.

Any thoughts?

Cheers, Christoph

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация