|
Posted by Jim Higson on 03/14/06 13:52
cawoodm@gmail.com wrote:
> Thank-you all for the helpful feedback.
> It is true that RegEx is a bit of a dark art but I am writing a Crawler
> in VB Dot Net and not Perl or Python.
> I am not sure if the .NET framework supports HTML parsing in the way I
> want it so I've been applying RegEx.
> Basically I want to strip all tags and then remove excess whitespace so
> that I have "pure" text.
> My current strategy is to replace inline tags with an empty string and
> then replacing all other tags with a space:
> HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
> HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
> Then I remove excess whitespace:
> HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
> It's the authorative list (b|u|i|strong|...) that I'm looking for so
> I'll take a look at the DTD recommended.
> Cheers
> Jack
The program I recomended (http://www.mbayer.de/html2text/) is a simple
command line app. You should be able to call it from just about any
language with one line of code. I don't know how you call commands in .NET,
but it shouldn't be difficult.
--
Jim
Navigation:
[Reply to this message]
|