Reply to Re: web harvesting

Your name:

Reply:


Posted by Rik on 06/24/06 11:09

McHenry wrote:
> I have formulated the follow regex... (first regex ever) and it seems
> to work when I test it using http://www.regexlib.com/RETester.aspx
> however when i try to implement it into my php code it fails:
>
> <div class=\"Overview\">((?s).*)(<div
> class=\"header\">((?s).*)</div>)((?s).*)(<div
> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>
>
> When I try to run the code I receive the following error:
> PHP Warning: Unknown modifier '(' in
> /var/www/html/research/processweb.php on line 98

The first character is taken as delimiter, so your regex stops after
\"Overview\">, and then treats everything as a modifier.
I assume your '***SNIP***'s are the actual content you'd like to obtain?

The Society for Understandable Regular Expressions brings you:
$pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
.*? #allow random content between starting overview and header
<div[^>]*?class="header"[^>]*?> #start of header
(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
from the header
</div> #end of header
.*? #once again allow random content
<div[^>]*?class="content"[^>]*?> #start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
from the content
</div> #end of content
.*? #I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div> #check for break
.*? # some random content
</div> #end of overview
%six';
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);

Some items explained:
% is chosen as delimiter of the regex here. Usually / is chosen, but as this
is HTML it would constantly have to be escaped. Choosing another delimiter
saves work.
[^>]*? allows a div to have other tags besides the classname, so it will
still be picked.
(?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the header/content
div, so still the whole div is matches, not just until the first child div
closes. (?: here means it's a non capturing pattern: we won;t see it back in
$matches, because we don't need it for the match as it is already contained
in the named match.
Modifiers:
s = . matches \n
i = case-insensitice
x = we can use line breaks & comments in our regex to keep it clear

Grtz,
--
Rik Wasmus

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация