Posted by Jim Higson on 03/10/06 16:48
cawoodm@gmail.com wrote:
> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...
How about using this?
http://www.mbayer.de/html2text/
--
Jim
Navigation:
[Reply to this message]
|