Posted by shimmyshack on 11/25/07 22:00
On Nov 25, 9:48 pm, suzanne.bo...@gmail.com wrote:
> Hi
>
> I have an html file with headings followed by one or more paragraphs
> like this
>
> <h2>blah blah 1</h2>
> <p>more blah blah blah</p>
>
> <h2>blah blah 2</h2>
> <p>more blah blah blah</p>
> <p>even more blah blah blah</p>
>
> I'd like to extract the text of the headings and the related
> paragraphs and insert them into a database. So far I've managed to
> get the heading text but cant figure out how to get the associated
> paragraphs. I've been using regular expressions, here is the
> expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text
> of the headings but not the paragraphs and now I'm basically stumped.
>
> Any help would be appreciated.
you could do this another way, although reg exp is a great way.
have you thought that you could use xml to so this.
since you are obviosuly starting with something which is basically
xml, why not just load the string as xml (topping and tailing it if
needed) and then extract using xpath.
[Back to original message]
|