| 
 Posted by Richard Levasseur on 07/06/06 18:48 
crescent_au@yahoo.com wrote: 
> Hi all, 
> 
> I've been trying unsuccessfully to get the text from html page. Html 
> tag that I'm interested in looks like this: 
> 
> <a class=link 
> href="http://www.something.com/_something.php?type=cart">Shopping 
> Cart</a> 
> <div><em class=newentry><a href=http://nothing.com>New 
> Age</a></em></div> 
> 
> >From the above tag, I want to extract "Shopping Cart". I'm not very 
> good with RE. I tried this: 
> $lines = file_get_contents("http://theabovetag.com/page.html"); 
> preg_match_all("/(<a\ class\=link\ href\=(.*)>)(<\/a>)/", $lines, 
> $matches1); 
> 
> The above RE gives me "Shopping Cart" plus "New Age" as well. I just 
> want "Shopping Cart". What am I doing wrong? My RE is somehow ignoring 
> </a> tag right after Shopping Cart and instead accepting </a> after New 
> Age. Please help! 
 
It most likely has to do with the greediness of *.  Regular expressions 
will match the *longest* possible string.  To prevent this, use '?'. 
given the string: "<a>text</a>more</a>" 
<a>.*</a> matches "<a>text</a>more</a>" 
<a>.*?</a> matches "<a>text</a>"
 
[Back to original message] 
 |