Reply to Re: Convert HTML to Text

Your name:

Reply:


Posted by mbstevens on 03/09/06 18:20

cawoodm@gmail.com wrote:
> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...
>

I don't have a specific answer to your last paragraph, but:

Have a look at Perl's HTML::Parser and related modules.

In Python, sgmllib will be useful.

Using simple regexes to parse HTML
is liable to more errors than libraries that have been
exercised by many users. Of course, you might have a good reason
to re-invent the wheel for another language, but even there having
a look at the source of these modules might be helpful.
--
mbstevens
http://www.mbstevens.com/

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация