|
Posted by Andrew Haylett on 01/22/57 11:39
Phil Earnhardt <pae@dim.com> wrote:
> On Tue, 07 Feb 2006 20:28:58 -0500, Barry Margolin
> <barmar@alum.mit.edu> wrote:
> >> I can't imagine how you would categorically block them. OTOH, the
> >> Robots Exclusion Protocol can be used to tell anyone who honors such
> >> things that you don't want your website copied.
> >
> >I wouldn't expect a manual download application to honor it. That
> >mechanism is intended to control automated web crawlers, like the ones
> >that Google uses to index all of the web.
> wget respects the Robot Exclusion Protocol; curl does not.
Hmm. wget's man page certainly says that it respects robots.txt - but
when I use its '-m' option to mirror my own site, it seems quite happy
to recurse into directories that have been explicitly disallowed in my
robots.txt.
Navigation:
[Reply to this message]
|