Reply to Re: Convert HTML to Text — HTML — IT news, forums, messages

Posted by cawoodm on 03/14/06 13:29

Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация