|
Posted by Jim Higson on 03/29/06 11:16
Jo wrote:
> Thanks..
> Im writing a HTML parser that removes the tags and keeps using sensible
> text. This is in C#.Its like a tool.But, can i add another tool to it
> like HTML Tidy to cleanup? Wud that be right?
> In webpages, i only want the main txt to be displayed and not the Side
> divisions on the left n right of the web page that mostly shows links
> to the other pages.
> I realised that in the web page im workin on now, has the right n left
> div inside font tag of their own specified class. So i will check
> whether its a font tag, then check for its class, if all are true, then
> i'll remove until a </font> tag comes. This was workin fine until one
> webpage showed me that </font> tag was missing for a <font> tag... Now
> what do i do?
> I have coded in C#..
Writing an error-tollerent HTML/SGML parser takes a long time. Do you have
to do this (ie it is for a school project) or could you use a preexisting
one?
TagSoup is a pretty good parser for bad HTML. See:
http://www.idealliance.org/papers/xml02/dx_xml02/html/abstract/05-06-06.html
TagSoup is in Java, but not every part of a project has to be in the same
language.
--
Jim
Navigation:
[Reply to this message]
|