| 
	
 | 
 Posted by Benjamin Niemann on 10/14/06 20:04 
Jeffrey wrote: 
 
>> 
>> > I've found an oddity with HTML/Javascript that I'm hoping someone on 
>> > this list could shed some light on for me.  This arose when I was using 
>> > the libxml parser to parse some HTML web pages. 
>> 
>> libxml is correct (too correct for such a usage), these and other 
>> websites not. 
>> 
>> As you can obviously not fix documents that are not your own and far too 
>> many documents on the web are malformed, invalid or simply a heap of 
>> s**t, it is not a wise decision to use a strict parser like libxml. 
>> There are special parsers built to deal with such 'tag-soup' documents, 
>> e.g. 'Beautiful Soup' for Python 
>> <http://www.crummy.com/software/BeautifulSoup/>. 
>> There may be similar packages for the language of your choice (if it does 
>> not happen to be Python). 
>  
> What you describe is exactly what I want.  Do you (or does anyone) know 
> of such a parser that will work in plain old C.  A search doesn't bring 
> up more than a few comments like, "hey, there should be a C Tag-Soup 
> library" and my application requires C.  Is "tag-soup" the name that I 
> should look under for this? 
 
HTML Tidy <http://tidy.sourceforge.net/> (better known as a stand-alone 
program which reads 'tag-soup' and outputs a cleaned up version) seems to 
be written in C and the functionality might be available through TidyLib 
('seems' and 'might', because this is just the result of a seconds on its 
website). 
You'll probably have to pass the documents through TidyLib to transform it 
to (at least) wellformed XML, which you can then parse with libxml. 
 
--  
Benjamin Niemann 
Email: pink at odahoda dot de 
WWW: http://pink.odahoda.de/
 
[Back to original message] 
 |