|
Posted by mbstevens on 03/09/06 18:20
cawoodm@gmail.com wrote:
> I have written a simple RegEx which strips all tags from an HTML file
> and replaces them with spaces.
>
> This was fine until I noticed that some tags should not be replaced
> with spaces. For example in the HTML:
> <b>H</b>ello World
> My program will generate "H ello World" effectively breaking a word
> apart.
>
> Where could I get an "authoritative" list of tags which should result
> in a space and which shouldn't. I presume these are mostly block
> elements like div, br, hr, table etc...
>
I don't have a specific answer to your last paragraph, but:
Have a look at Perl's HTML::Parser and related modules.
In Python, sgmllib will be useful.
Using simple regexes to parse HTML
is liable to more errors than libraries that have been
exercised by many users. Of course, you might have a good reason
to re-invent the wheel for another language, but even there having
a look at the source of these modules might be helpful.
--
mbstevens
http://www.mbstevens.com/
Navigation:
[Reply to this message]
|