|  | Posted by monochromec on 05/08/07 22:43 
Hi all,
 I'm in the process of setting up a PHP script that reads a HTML file,
 does a character conversion and then displays the contents of a single
 HTML tag as follows:
 
 
 $str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
 'HTML-ENTITIES', 'ISO-8859-1');
 
 file_put_contents ('dmp.htm', $str);
 
 $dom = DOMDocument::loadHTML ($str);
 $elem = $dom->getElementsByTagName ('h5');
 if ($elem->length) {
 $n = $elem->item (0)->nodeValue;
 var_dump (bin2hex ($n));
 
 What's interesting is that the source HTML file is properly ISO-8859-1
 encoded (which the contents of "dmp.htm" verifies). The trouble starts
 when I retrieve the contents of the first <h5> tag that has an umlaut
 in it. In this case, the umlaut is screwed up - what used to be a
 "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü" (0xc3 0x9c
 as the var_dump confirms). What surprises me are two things: that
 somehow the character changes and that the umlaut is not HTML-encoded
 as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
 box.
 
 Any thoughts?
 
 Cheers, Christoph
  Navigation: [Reply to this message] |