|
Posted by tihu on 05/06/06 13:14
aka_eu wrote:
> Hi,
>
> recently I got a project to get info from different websites and to put
> the info into a DB.
> Now, I was wondering what is the best technique to implement something
> like that.
>
> How I should open the pages from other websites. With fopen, throught a
> socket or with a curl.
Either way works, depends what website you are accessing and what you
need to do. If your answer to any of the questions if yes then use
curl.
Will your script need to auto-submit any forms to these websites? Do
any of the sites use cookies? If a page is inaccessible do you need to
know why?
file_get_contents is the easiest way but not informative if the webpage
was inacessible and it can only perform simple get requests.
Curl can has comprehensive error reporting and you can post forms using
setopt CURLOPT_POST and CURLOPT_POSTFIELDS, and it can deal with cookie
based websites, pretend its a browser/bot and has plenty of other
useful stuff.
You could do all this yourself using sockets but its already been done
with curl and sooo tedious.
>
> After that what is the faster way to parse a whole page for info.. and
> offcourse to parse it little times to get different info from the same
> page.
Best use DOM.
I've seen some people use regular expressions to do it but the regexes
soon end up being a nightmare to maintain or change when the website
inevitably changes. But if you're only looking for a few pieces of
information from a few sites preg_match could work.
With Dom you parse the page into a domtree using
DOMDocument->loadHTML(), then use the dom methods and xpath to get what
you want. Especially xpath....
Don't know if its fastest to execute during runtime but if anyone knows
a more flexible, useful way of data mining I need to know.
The dom method getElementById doesn't work unless the page has a proper
doctype ( meaning most webpages )
http://blog.bitflux.ch/wiki/GetElementById_Pitfalls explains the
problem and the solutions, there's a straightforward example of using
xpath as well.
http://www.zvon.org/xxl/XPathTutorial/General/examples.html is a good
xpath tutorial, ugly site but there's plenty of good examples to learn
from and an interactive lab.
Seeya
Tim
Navigation:
[Reply to this message]
|