|
Posted by McHenry on 06/26/06 03:33
"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:8b14a$449ebfe2$8259c69c$19227@news2.tudelft.nl...
> McHenry wrote:
>>> The comment is between # and a newline. As you concat everything in
>>> stead of
>>> just newlining it inside the quotes, the expressions breaks. Why do
>>> you concat by the way?
>>
>> I thought this was the way I had to do it... (new to php, new to
>> Linux, new to many things)
>> Now I understand, I thought the comments were part of the regex and
>> couldn't understand how it worked... :)
>
> Hehe, yeah, then it get's tricky :-).
>
>>> That's correct behaviour, (:? means a NON capturing pattern.
>>
>> Your original solution used (?: not (:? is there a difference or is
>> this a typo ?
>
> Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-)
>
>>> If you only want the <h1> field form the header-div:
>>>
>>> <div[^>]*?class="header"[^>]*>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> <h1>(?P<header>.*?)</h1>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> </div>
>>
>> Why do you use a ? after a * I would have thought the usage of these
>> would be mutually exclusive, for example my understanding of
>> *?
>
>
>> match 0 or more of the previous expression
>> match 0 or 1 of the previous expression
>
> Nope, a ? after a * makes it non-greedy. It will give you back the
> shortest
> match possible, instead of the longest.
>
> To illustrate, say we want to capture the contents of the following divs:
> $string = '<div>something</div><div>something else</div>';
>
> preg_match_all('%<div>(.*)</div>%',$string,$match1);
> preg_match_all('%<div>(.*?)</div>%',$string,$match2);
>
> print_r($match1);
> print_r($match2);
>
> Will give:
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div><div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something</div><div>something else
> )
>
> )
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div>
> [1] => <div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something
> [1] => something else
> )
>
> )
>
>
> --
> Rik Wasmus
>
>
Rik,
When I implement either of the two options above the regex stops working ?
$pattern='%<div[^>]*?class="overview"[^>]*?>
#start of overview
.*?
#allow random content between starting overview and header
<div[^>]*?class="header"[^>]*>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
<h1>(?P<header>.*?)</h1>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
</div>
.*?
#once again allow random content
<div[^>]*?class="content"[^>]*?>
#start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*)
#get a named match from the content
</div>
#end of content
.*?
#I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div>
#check for break
.*?
#some random content
</div>
#end of overview
%six';
I am trying to comprehend these expressions so I can solve them myself and
not trouble yourself however there are either very complex regexs or I am a
very slow learner... most likely the second :)
My breakdown and understanding of the regex above is:
<div[^>]*?class="overview"[^>]*?> #Match the start of the overview
========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="overview"
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: >
..*? #Match any content between the overview and header
========================================
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
<div[^>]*?class="header"[^>]*> #Match the header
========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="header"
match any character other than >
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >
(?:<div[^>]*>.*?</div>.*?)*? #Does this eliminate nested divs within the
header div ?
========================================
Non capturing pattern
match the string: <div
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: </div>
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match 0 or more of the prev expression in brackets only until the first
occurrance of the next match is found (non greedy)
<h1>(?P<header>.*?)</h1> #Match the contents between the h1 tags
========================================
match the string: <h1>
caputure all chars only until the first occurrance of the next match is
found (non greedy) and name the subpattern
match the string: <h2>
Thanks for all your help so far and I think I'm getting there...
[Back to original message]
|