|
Posted by Rik on 08/17/06 22:12
Shuan wrote:
> I am trying to grab sites like craigslist, parse with regular
> expression and put some content into database.
>
> $request -> fetch( $region_link );
>
> if( !$request -> error ){
> $pageContent = $request -> results;
>
> $regionpattern =
> "/<a[^>]*href=\"(\/s\/SL\/sg_maY.*)\".*>.*<img.*alt=\"(.*)\".*id=\"btn.*\">/
> siU";
>
> if(preg_match_all( $regionpattern, $pageContent, $categorylinks ))
I was almost tempted to say it was a greedyness issue, before I spotted the /U.
Dodged a bullet there :-).
If I interprete you regex correctly, try this rewrite (I tend to use dots very
sparingly, I'm more a fan of negative character classes, in which proper
greediness is more usefull). I'm not really sure it will gain much on the
resources consumption, but we can try:
'|<a[^>]*?href="(/s/SL/sg_maY[^"]*)"[^>]*>.*?<img[^>]*?alt="([^"]*)"[^>]*?id="bt
n[^"]*"[^>]*>|si
I'd suggest a foreach loop also, instead your for loop:
foreach($categorylinks[1] as $link){
$category_link="http://www.mysite.com".$link;
include( "pagecrawler.php" );//I'm still curious what this does....
}
Or if you do use capture 2:
if(preg_match_all( $regionpattern, $pageContent, $categorylinks,
PREG_SET_ORDER)){
foreach($categorylinks as $link){
$category_link="http://www.mysite.com".$link[1];
include( "pagecrawler.php" );//I'm still curious what this does....
}
}
If you still have issues I'd like to see/know the actual site you're leeching
right now :-).(If you're trying to get a page all at once, be sure to unset()
unused/past variables.) I don't know what your actual pagecrawler.php does, but
if it doesn't use capture 2 you might as well not capture it.
Grtz,
--
Rik Wasmus
Navigation:
[Reply to this message]
|