Reply to Re: web harvesting

Your name:

Reply:


Posted by McHenry on 06/26/06 03:33

"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:8b14a$449ebfe2$8259c69c$19227@news2.tudelft.nl...
> McHenry wrote:
>>> The comment is between # and a newline. As you concat everything in
>>> stead of
>>> just newlining it inside the quotes, the expressions breaks. Why do
>>> you concat by the way?
>>
>> I thought this was the way I had to do it... (new to php, new to
>> Linux, new to many things)
>> Now I understand, I thought the comments were part of the regex and
>> couldn't understand how it worked... :)
>
> Hehe, yeah, then it get's tricky :-).
>
>>> That's correct behaviour, (:? means a NON capturing pattern.
>>
>> Your original solution used (?: not (:? is there a difference or is
>> this a typo ?
>
> Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-)
>
>>> If you only want the <h1> field form the header-div:
>>>
>>> <div[^>]*?class="header"[^>]*>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> <h1>(?P<header>.*?)</h1>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> </div>
>>
>> Why do you use a ? after a * I would have thought the usage of these
>> would be mutually exclusive, for example my understanding of
>> *?
>
>
>> match 0 or more of the previous expression
>> match 0 or 1 of the previous expression
>
> Nope, a ? after a * makes it non-greedy. It will give you back the
> shortest
> match possible, instead of the longest.
>
> To illustrate, say we want to capture the contents of the following divs:
> $string = '<div>something</div><div>something else</div>';
>
> preg_match_all('%<div>(.*)</div>%',$string,$match1);
> preg_match_all('%<div>(.*?)</div>%',$string,$match2);
>
> print_r($match1);
> print_r($match2);
>
> Will give:
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div><div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something</div><div>something else
> )
>
> )
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div>
> [1] => <div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something
> [1] => something else
> )
>
> )
>
>
> --
> Rik Wasmus
>
>


Rik,

When I implement either of the two options above the regex stops working ?

$pattern='%<div[^>]*?class="overview"[^>]*?>
#start of overview
.*?
#allow random content between starting overview and header

<div[^>]*?class="header"[^>]*>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
<h1>(?P<header>.*?)</h1>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
</div>

.*?
#once again allow random content
<div[^>]*?class="content"[^>]*?>
#start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*)
#get a named match from the content
</div>
#end of content
.*?
#I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div>
#check for break
.*?
#some random content
</div>
#end of overview
%six';


I am trying to comprehend these expressions so I can solve them myself and
not trouble yourself however there are either very complex regexs or I am a
very slow learner... most likely the second :)

My breakdown and understanding of the regex above is:

<div[^>]*?class="overview"[^>]*?> #Match the start of the overview
========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="overview"
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: >

..*? #Match any content between the overview and header
========================================
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)

<div[^>]*?class="header"[^>]*> #Match the header

========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="header"
match any character other than >
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >

(?:<div[^>]*>.*?</div>.*?)*? #Does this eliminate nested divs within the
header div ?
========================================
Non capturing pattern
match the string: <div
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: </div>
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match 0 or more of the prev expression in brackets only until the first
occurrance of the next match is found (non greedy)

<h1>(?P<header>.*?)</h1> #Match the contents between the h1 tags

========================================
match the string: <h1>
caputure all chars only until the first occurrance of the next match is
found (non greedy) and name the subpattern
match the string: <h2>


Thanks for all your help so far and I think I'm getting there...

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация