Posted by Benjamin Niemann on 10/13/06 17:49
Hello,
Jeffrey wrote:
> I've found an oddity with HTML/Javascript that I'm hoping someone on
> this list could shed some light on for me. This arose when I was using
> the libxml parser to parse some HTML web pages.
libxml is correct (too correct for such a usage), these and other websites
not.
As you can obviously not fix documents that are not your own and far too
many documents on the web are malformed, invalid or simply a heap of s**t,
it is not a wise decision to use a strict parser like libxml.
There are special parsers built to deal with such 'tag-soup' documents,
e.g. 'Beautiful Soup' for Python
<http://www.crummy.com/software/BeautifulSoup/>.
There may be similar packages for the language of your choice (if it does
not happen to be Python).
HTH
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
[Back to original message]
|