| 
	
 | 
 Posted by Tim Martin on 07/05/06 14:35 
trihanhcie@gmail.com wrote: 
> Hi, 
>  
> I would like to extract the text in an HTML file 
> For the moment, I'm trying to get all text between <td> and </td>. I 
> used a regular expression  because i don't know the "format between 
> <td> and </td> 
>  
> It can be : 
> <td> text1 </td> 
> or 
> <td> 
> text1 
> </td> 
> or anything else 
>  
> eregi("<td(.*)>(.*)(</td>?)",$text,$regtext); 
>  
> The problem is that, if I have 
> <td> text</td> 
> <td>text2</td> 
>  
> regtext will return text</td><td>text2. 
>  
> How can I change the expression so that it stops at the first occurence 
> of </td>? 
 
If that's all you want to change, then you can just add the '?' (minimal  
match) qualifier to the '.*' within your regexp. By default, the '*'  
operator is "greedy" (that is, matches as much data as possible). If you  
replace that with '.*?' it will find the minimum amount of text that  
satisfies your requirements. 
 
If you want heavier-duty HTML parsing, you're probably better of looking  
for a library rather than trying to do it all by hand anyway, as the  
other poster suggested. 
 
Tim
 
  
Navigation:
[Reply to this message] 
 |