|
Posted by Nikita the Spider on 10/07/07 03:19
In article <1191684966.878248.236520@57g2000hsv.googlegroups.com>,
Math <mathieu.lory@gmail.com> wrote:
> Hi,
>
> There is something I really don't understand ; and I would like your
> advises...
>
> 1. Some websites, (for instance news.google.fr) contains a
> syndication feed (like http://news.google.fr/nwshp?topic=po&output=atom).
>
> 2. Theses websites have a robots.txt file preventing some robots
> (declared by user-agents) from indexation.
> For example : http://news.google.fr/robots.txt contains (extract) :
> User-agent: *
> Disallow: /nwshp
>
> 3. I've developped an syndication aggregator, and I woul'd like to
> respect these robots.txt files. but as I can see and understand, my
> user-agent isn't authorized to acces /nwshp?topic=po&output=atom
> because of this robots.txt...
>
> So, is it normal ? robots.txt files are only for indexation robots ?
> to sum up, my syndication aggregator should respect these files or
> not ?
Hi Math,
It's hard to say, but if they prefer to keep this content from being
copied to other sites, robots.txt is the way to do it. In other words,
you can't assume they just want to keep indexing bots out, they might
want to keep all bots out.
If your aggregator is only being used by you and a few friends, then
probably Google et al wouldn't care if your bot visits them once per
hour or so. But if you want this aggregator to be used by lots of
people, then I'd say you need to respect robots.txt.
BTW the closest thing there is to a standard for robots.txt is here:
http://www.robotstxt.org/wc/norobots-rfc.html
When describing robots, it focuses on indexing bots. But it was written
at a time when Web robots were less varied then they are now, so the
author may not have considered your case.
Good luck
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
[Back to original message]
|