Re: Convert HTML to Text — HTML — IT news, forums, messages

You are here: Re: Convert HTML to Text « HTML « IT news, forums, messages

Posted by Jim Higson on 03/14/06 13:52

cawoodm@gmail.com wrote:

> Thank-you all for the helpful feedback.
> It is true that RegEx is a bit of a dark art but I am writing a Crawler
> in VB Dot Net and not Perl or Python.
> I am not sure if the .NET framework supports HTML parsing in the way I
> want it so I've been applying RegEx.
> Basically I want to strip all tags and then remove excess whitespace so
> that I have "pure" text.
> My current strategy is to replace inline tags with an empty string and
> then replacing all other tags with a space:
> HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
> HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
> Then I remove excess whitespace:
> HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
> It's the authorative list (b|u|i|strong|...) that I'm looking for so
> I'll take a look at the DTD recommended.
> Cheers
> Jack

The program I recomended (http://www.mbayer.de/html2text/) is a simple
command line app. You should be able to call it from just about any
language with one line of code. I don't know how you call commands in .NET,
but it shouldn't be difficult.

--
Jim

Navigation:

Next in forum: Re: html forms
Prev in forum: Re: html forms
Thread view: Re: Convert HTML to Text

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация