You are here: Re: web harvesting « All PHP « IT news, forums, messages
Re: web harvesting

Posted by McHenry on 06/25/06 08:05

"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:aac61$449e3a5b$8259c69c$14679@news2.tudelft.nl...
> McHenry wrote:
>>> This works great however when I try to view the contents of the
>>> array I am only presented with a single element:
>
>>> Here is the code I am using:
>>>
>>> $pattern='%<div[^>]*?class="overview"[^>]*?> #start
>>> of overview ';
>>> $pattern=$pattern.'.*?
>
> The comment is between # and a newline. As you concat everything in stead
> of
> just newlining it inside the quotes, the expressions breaks. Why do you
> concat by the way?

I thought this was the way I had to do it... (new to php, new to Linux, new
to many things)
Now I understand, I thought the comments were part of the regex and couldn't
understand how it worked... :)

>
>> Maybe it should have been obvious but I missed it anyway I removed the
>> comments from inside the pattern string and it now works.
>>
>> I love the concept of the named match which makes it very easy to
>> reference in an array, very powerfull.
>>
>> Within the header I have a field I would like to capture between
>> <h1>field_here</h1> I suspected I could achieve this by replacing:
>> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*)
>>
>> with
>>
>> (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*)
>>
>> however nothing changed when I printed the array value of 'header'?
>
> That's correct behaviour, (:? means a NON capturing pattern.

Your original solution used (?: not (:? is there a difference or is this a
typo ?

>
> If you only want the <h1> field form the header-div:
>
> <div[^>]*?class="header"[^>]*>
> .*?(:?<div[^>]*>.*?</div>.*?)*?
> <h1>(?P<header>.*?)</h1>
> .*?(:?<div[^>]*>.*?</div>.*?)*?
> </div>

Why do you use a ? after a * I would have thought the usage of these would
be mutually exclusive, for example my understanding of
<div[^>]*?class="header"[^>]*> is:

match the pattern <div
match any character other than >
match 0 or more of the previous expression
match 0 or 1 of the previous expression
match the pattern class="header"
match any character other than >
match 0 or more of the previous expression
match the pattern >

I appreciate your assistance...

>
>
> If you want the whole header-div and the h2-field again in a seperate div:
> <div[^>]*?class="header"[^>]*>
> (?P<header>.*?(:?<div[^>]*>.*?</div>.*?)*?
> <h1>(?P<h1>.*?)</h1>
> .*?(:?<div[^>]*>.*?</div>.*?)*?)
> </div>
>
> Grtz,
> --
> Rik Wasmus
>
>

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация