Re: Convert HTML to Text — HTML — IT news, forums, messages

You are here: Re: Convert HTML to Text « HTML « IT news, forums, messages

Posted by mbstevens on 03/09/06 18:20

cawoodm@gmail.com wrote:
> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...
>

I don't have a specific answer to your last paragraph, but:

Have a look at Perl's HTML::Parser and related modules.

In Python, sgmllib will be useful.

Using simple regexes to parse HTML
is liable to more errors than libraries that have been
exercised by many users. Of course, you might have a good reason
to re-invent the wheel for another language, but even there having
a look at the source of these modules might be helpful.
--
mbstevens
http://www.mbstevens.com/

Navigation:

Next in forum: Re: image map
Prev in forum: image map
Thread view: Re: Convert HTML to Text

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация