|
Posted by monochromec on 05/09/07 11:16
Hi all,
I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:
$str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');
file_put_contents ('dmp.htm', $str);
$dom = DOMDocument::loadHTML ($str);
$elem = $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n = $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));
What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5> tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ" (0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.
Any thoughts?
Cheers, Christoph
[Back to original message]
|