|
Posted by cawoodm on 03/14/06 13:29
Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack
[Back to original message]
|