Re: web harvesting — All PHP — IT news, forums, messages

You are here: Re: web harvesting « All PHP « IT news, forums, messages

Posted by McHenry on 06/26/06 03:33

"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:8b14a$449ebfe2$8259c69c$19227@news2.tudelft.nl...
> McHenry wrote:
>>> The comment is between # and a newline. As you concat everything in
>>> stead of
>>> just newlining it inside the quotes, the expressions breaks. Why do
>>> you concat by the way?
>>
>> I thought this was the way I had to do it... (new to php, new to
>> Linux, new to many things)
>> Now I understand, I thought the comments were part of the regex and
>> couldn't understand how it worked... :)
>
> Hehe, yeah, then it get's tricky :-).
>
>>> That's correct behaviour, (:? means a NON capturing pattern.
>>
>> Your original solution used (?: not (:? is there a difference or is
>> this a typo ?
>
> Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-)
>
>>> If you only want the <h1> field form the header-div:
>>>
>>> <div[^>]*?class="header"[^>]*>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> <h1>(?P<header>.*?)</h1>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> </div>
>>
>> Why do you use a ? after a * I would have thought the usage of these
>> would be mutually exclusive, for example my understanding of
>> *?
>
>
>> match 0 or more of the previous expression
>> match 0 or 1 of the previous expression
>
> Nope, a ? after a * makes it non-greedy. It will give you back the
> shortest
> match possible, instead of the longest.
>
> To illustrate, say we want to capture the contents of the following divs:
> $string = '<div>something</div><div>something else</div>';
>
> preg_match_all('%<div>(.*)</div>%',$string,$match1);
> preg_match_all('%<div>(.*?)</div>%',$string,$match2);
>
> print_r($match1);
> print_r($match2);
>
> Will give:
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div><div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something</div><div>something else
> )
>
> )
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div>
> [1] => <div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something
> [1] => something else
> )
>
> )
>
>
> --
> Rik Wasmus
>
>

Rik,

When I implement either of the two options above the regex stops working ?

$pattern='%<div[^>]*?class="overview"[^>]*?>
#start of overview
.*?
#allow random content between starting overview and header

<div[^>]*?class="header"[^>]*>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
<h1>(?P<header>.*?)</h1>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
</div>

.*?
#once again allow random content
<div[^>]*?class="content"[^>]*?>
#start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*)
#get a named match from the content
</div>
#end of content
.*?
#I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div>
#check for break
.*?
#some random content
</div>
#end of overview
%six';

I am trying to comprehend these expressions so I can solve them myself and
not trouble yourself however there are either very complex regexs or I am a
very slow learner... most likely the second :)

My breakdown and understanding of the regex above is:

<div[^>]*?class="overview"[^>]*?> #Match the start of the overview
========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="overview"
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: >

..*? #Match any content between the overview and header
========================================
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)

<div[^>]*?class="header"[^>]*> #Match the header

========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="header"
match any character other than >
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >

(?:<div[^>]*>.*?</div>.*?)*? #Does this eliminate nested divs within the
header div ?
========================================
Non capturing pattern
match the string: <div
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: </div>
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match 0 or more of the prev expression in brackets only until the first
occurrance of the next match is found (non greedy)

<h1>(?P<header>.*?)</h1> #Match the contents between the h1 tags

========================================
match the string: <h1>
caputure all chars only until the first occurrance of the next match is
found (non greedy) and name the subpattern
match the string: <h2>

Thanks for all your help so far and I think I'm getting there...

Navigation:

Next in forum: Re: Designing with Databases - Best Practice Class Hierachy?
Prev in forum: Re: Designing with Databases - Best Practice Class Hierachy?
Thread view: Re: web harvesting

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация