You are here: Re: Convert HTML to Text « HTML « IT news, forums, messages
Re: Convert HTML to Text

Posted by cawoodm on 03/14/06 13:29

Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация