|
Posted by trihanhcie on 07/05/06 14:41
Thanks for your advice :D Well the 'ungreedy' solution worked for the
moment ;)
I will try the library later :)
Tim Martin wrote:
> trihanhcie@gmail.com wrote:
> > Hi,
> >
> > I would like to extract the text in an HTML file
> > For the moment, I'm trying to get all text between <td> and </td>. I
> > used a regular expression because i don't know the "format between
> > <td> and </td>
> >
> > It can be :
> > <td> text1 </td>
> > or
> > <td>
> > text1
> > </td>
> > or anything else
> >
> > eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);
> >
> > The problem is that, if I have
> > <td> text</td>
> > <td>text2</td>
> >
> > regtext will return text</td><td>text2.
> >
> > How can I change the expression so that it stops at the first occurence
> > of </td>?
>
> If that's all you want to change, then you can just add the '?' (minimal
> match) qualifier to the '.*' within your regexp. By default, the '*'
> operator is "greedy" (that is, matches as much data as possible). If you
> replace that with '.*?' it will find the minimum amount of text that
> satisfies your requirements.
>
> If you want heavier-duty HTML parsing, you're probably better of looking
> for a library rather than trying to do it all by hand anyway, as the
> other poster suggested.
>
> Tim
[Back to original message]
|