You are here: Re: web harvesting « All PHP « IT news, forums, messages
Re: web harvesting

Posted by McHenry on 06/25/06 06:02

"McHenry" <mchenry@mchenry.com> wrote in message
news:449de5f2$0$6645$5a62ac22@per-qv1-newsreader-01.iinet.net.au...
>
> "Rik" <luiheidsgoeroe@hotmail.com> wrote in message
> news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl...
>> McHenry wrote:
>>> I have formulated the follow regex... (first regex ever) and it seems
>>> to work when I test it using http://www.regexlib.com/RETester.aspx
>>> however when i try to implement it into my php code it fails:
>>>
>>> <div class=\"Overview\">((?s).*)(<div
>>> class=\"header\">((?s).*)</div>)((?s).*)(<div
>>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>>>
>>>
>>> When I try to run the code I receive the following error:
>>> PHP Warning: Unknown modifier '(' in
>>> /var/www/html/research/processweb.php on line 98
>>
>> The first character is taken as delimiter, so your regex stops after
>> \"Overview\">, and then treats everything as a modifier.
>> I assume your '***SNIP***'s are the actual content you'd like to obtain?
>>
>> The Society for Understandable Regular Expressions brings you:
>> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
>> .*? #allow random content between starting overview and header
>> <div[^>]*?class="header"[^>]*?> #start of header
>> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
>> from the header
>> </div> #end of header
>> .*? #once again allow random content
>> <div[^>]*?class="content"[^>]*?> #start of content
>> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
>> from the content
>> </div> #end of content
>> .*? #I am not sure wether you need the code from this point on
>> <div[^>]*?class="break"[^>]*?></div> #check for break
>> .*? # some random content
>> </div> #end of overview
>> %six';
>> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
>>
>> Some items explained:
>> % is chosen as delimiter of the regex here. Usually / is chosen, but as
>> this
>> is HTML it would constantly have to be escaped. Choosing another
>> delimiter
>> saves work.
>> [^>]*? allows a div to have other tags besides the classname, so it will
>> still be picked.
>> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the
>> header/content
>> div, so still the whole div is matches, not just until the first child
>> div
>> closes. (?: here means it's a non capturing pattern: we won;t see it back
>> in
>> $matches, because we don't need it for the match as it is already
>> contained
>> in the named match.
>> Modifiers:
>> s = . matches \n
>> i = case-insensitice
>> x = we can use line breaks & comments in our regex to keep it clear
>>
>> Grtz,
>> --
>> Rik Wasmus
>>
>>
>
> Rik,
>
> This works great however when I try to view the contents of the array I am
> only presented with a single element:
>
> Array
> (
> [0] => Array
> (
> [0] => <div class="overview">
> )
>
> )
>
>
>
> Here is the code I am using:
>
> //Extract the content from the page
> $pattern='%<div[^>]*?class="overview"[^>]*?> #start of
> overview ';
> $pattern=$pattern.'.*? #allow
> random content between starting overview and header ';
> $pattern=$pattern.'<div[^>]*?class="header"[^>]*?> #start of
> header ';
> $pattern=$pattern.'(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
> named match from the header ';
> $pattern=$pattern.'</div> #end of
> header ';
> $pattern=$pattern.'.*? #once
> again allow random content ';
> $pattern=$pattern.'<div[^>]*?class="content"[^>]*?> #start of
> content ';
> $pattern=$pattern.'(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
> named match from the content ';
> $pattern=$pattern.'</div> #end of
> content ';
> $pattern=$pattern.'.*? #I am not
> sure wether you need the code from this point on ';
> $pattern=$pattern.'<div[^>]*?class="break"[^>]*?></div> #check
> for break ';
> $pattern=$pattern.'.*? #some
> random content ';
> $pattern=$pattern.'</div> #end of
> overview ';
> $pattern=$pattern.'%six';
>
> if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) {
> print_r($matches);
> }
>

Maybe it should have been obvious but I missed it anyway I removed the
comments from inside the pattern string and it now works.

I love the concept of the named match which makes it very easy to reference
in an array, very powerfull.

Within the header I have a field I would like to capture between
<h1>field_here</h1> I suspected I could achieve this by replacing:
(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*)

with

(?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*)

however nothing changed when I printed the array value of 'header'?

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация