Reply to Re: Convert HTML to Text

Your name:

Reply:


Posted by Jim Higson on 03/14/06 13:52

cawoodm@gmail.com wrote:

> Thank-you all for the helpful feedback.
> It is true that RegEx is a bit of a dark art but I am writing a Crawler
> in VB Dot Net and not Perl or Python.
> I am not sure if the .NET framework supports HTML parsing in the way I
> want it so I've been applying RegEx.
> Basically I want to strip all tags and then remove excess whitespace so
> that I have "pure" text.
> My current strategy is to replace inline tags with an empty string and
> then replacing all other tags with a space:
> HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
> HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
> Then I remove excess whitespace:
> HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
> It's the authorative list (b|u|i|strong|...) that I'm looking for so
> I'll take a look at the DTD recommended.
> Cheers
> Jack

The program I recomended (http://www.mbayer.de/html2text/) is a simple
command line app. You should be able to call it from just about any
language with one line of code. I don't know how you call commands in .NET,
but it shouldn't be difficult.

--
Jim

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация