|
Posted by Benjamin Niemann on 10/14/06 20:04
Jeffrey wrote:
>>
>> > I've found an oddity with HTML/Javascript that I'm hoping someone on
>> > this list could shed some light on for me. This arose when I was using
>> > the libxml parser to parse some HTML web pages.
>>
>> libxml is correct (too correct for such a usage), these and other
>> websites not.
>>
>> As you can obviously not fix documents that are not your own and far too
>> many documents on the web are malformed, invalid or simply a heap of
>> s**t, it is not a wise decision to use a strict parser like libxml.
>> There are special parsers built to deal with such 'tag-soup' documents,
>> e.g. 'Beautiful Soup' for Python
>> <http://www.crummy.com/software/BeautifulSoup/>.
>> There may be similar packages for the language of your choice (if it does
>> not happen to be Python).
>
> What you describe is exactly what I want. Do you (or does anyone) know
> of such a parser that will work in plain old C. A search doesn't bring
> up more than a few comments like, "hey, there should be a C Tag-Soup
> library" and my application requires C. Is "tag-soup" the name that I
> should look under for this?
HTML Tidy <http://tidy.sourceforge.net/> (better known as a stand-alone
program which reads 'tag-soup' and outputs a cleaned up version) seems to
be written in C and the functionality might be available through TidyLib
('seems' and 'might', because this is just the result of a seconds on its
website).
You'll probably have to pass the documents through TidyLib to transform it
to (at least) wellformed XML, which you can then parse with libxml.
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Navigation:
[Reply to this message]
|