You are here: Re: web harvesting « All PHP « IT news, forums, messages
Re: web harvesting

Posted by McHenry on 06/24/06 14:06

"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl...
> McHenry wrote:
>> I have formulated the follow regex... (first regex ever) and it seems
>> to work when I test it using http://www.regexlib.com/RETester.aspx
>> however when i try to implement it into my php code it fails:
>>
>> <div class=\"Overview\">((?s).*)(<div
>> class=\"header\">((?s).*)</div>)((?s).*)(<div
>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>>
>>
>> When I try to run the code I receive the following error:
>> PHP Warning: Unknown modifier '(' in
>> /var/www/html/research/processweb.php on line 98
>
> The first character is taken as delimiter, so your regex stops after
> \"Overview\">, and then treats everything as a modifier.
> I assume your '***SNIP***'s are the actual content you'd like to obtain?
>
> The Society for Understandable Regular Expressions brings you:
> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
> .*? #allow random content between starting overview and header
> <div[^>]*?class="header"[^>]*?> #start of header
> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the header
> </div> #end of header
> .*? #once again allow random content
> <div[^>]*?class="content"[^>]*?> #start of content
> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the content
> </div> #end of content
> .*? #I am not sure wether you need the code from this point on
> <div[^>]*?class="break"[^>]*?></div> #check for break
> .*? # some random content
> </div> #end of overview
> %six';
> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
>
> Some items explained:
> % is chosen as delimiter of the regex here. Usually / is chosen, but as
> this
> is HTML it would constantly have to be escaped. Choosing another delimiter
> saves work.
> [^>]*? allows a div to have other tags besides the classname, so it will
> still be picked.
> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the
> header/content
> div, so still the whole div is matches, not just until the first child div
> closes. (?: here means it's a non capturing pattern: we won;t see it back
> in
> $matches, because we don't need it for the match as it is already
> contained
> in the named match.
> Modifiers:
> s = . matches \n
> i = case-insensitice
> x = we can use line breaks & comments in our regex to keep it clear
>
> Grtz,
> --
> Rik Wasmus
>
>


WOW Rik... it's a little different from my attempt :)

Thank you very much as this would have taken me a few... YEARS !

Not to question but I am trying to understand what you have provided and I
am unable to get the pattern to work here for learning purposes:
http://www.regexlib.com/RETester.aspx

Should I not rely on this tool or am I missing something ?

Thanks once again...

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация