|
Posted by monochromec on 05/09/07 11:57
On May 9, 1:16 pm, monochro...@gmail.com wrote:
> Hi all,
>
> I'm in the process of setting up a PHP script that reads a HTML file,
> does a character conversion and then displays the contents of a single
> HTML tag as follows:
>
> $str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
> 'HTML-ENTITIES', 'ISO-8859-1');
>
> file_put_contents ('dmp.htm', $str);
>
> $dom = DOMDocument::loadHTML ($str);
> $elem = $dom->getElementsByTagName ('h5');
> if ($elem->length) {
> $n = $elem->item (0)->nodeValue;
> var_dump (bin2hex ($n));
>
> What's interesting is that the source HTML file is properly ISO-8859-1
> encoded (which the contents of "dmp.htm" verifies). The trouble starts
> when I retrieve the contents of the first <h5> tag that has an umlaut
> in it. In this case, the umlaut is screwed up - what used to be a
> "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ" (0xc3 0x9c
> as the var_dump confirms). What surprises me are two things: that
> somehow the character changes and that the umlaut is not HTML-encoded
> as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
> box.
>
> Any thoughts?
>
> Cheers, Christoph
After some :-) research, it turns out that the encoding of the
contents of the first <h5> tag
has acutally changed to UTF-8 - hence the strange byte sequence. This
begs the question
if the default encoding for parsed HTML strings in the DOM package is
UTF-8 (if we are looking
at HTML-ENTITIES-conformant encoding initially). Is this a bug of
DOMDocument or a feature?
Cheers, Christoph
Navigation:
[Reply to this message]
|