|
Posted by Nikita the Spider on 09/01/06 13:54
In article <1157102156.694620.114720@m73g2000cwd.googlegroups.com>,
"Paul" <desotuatail@aol.com> wrote:
> nice.guy.nige wrote:
> > While the city slept, Nikita the Spider (NikitaTheSpider@gmail.com)
> > feverishly typed...
> >
> > [...]
> > > But I'm wondering what you mean by saying your Web
> > > site is "under attack". Yahoo! Slurp and Googlebot try to be
> > > reasonably polite when spidering a site.
> >
> > Indeed... Last time around, Googlebot even made me a cup of tea! ;-)
> >
>
> Can I get a web robot to only see one or 2 files as they link to areas
> of my site that I do want indexed. LIKE
>
> User-agent: *
> Allow: /history.php
> Disallow: /
Paul,
My guess is that this will probably work with most bots, but it isn't a
sure thing. Oddly enough, robots.txt is not as clearly standardized as
HTML or HTTP. The authoritative reference for it is those few pages on
robotstxt.org -- there's no RFC that defines the format. The original
description of robots.txt in 1994 didn't permit "Allow:" fields. An
updated proposal from 1996 defines the Allow fields, but that proposal
never made it beyond draft stage:
http://www.robotstxt.org/wc/norobots-rfc.html
Since it was a draft proposal, does that make it more or less
authoritative than the 1994 document? It's up to robot authors to
decide. My spider (see my sig) obeys all of the 1994 and 1996
specifications (except for one small part where the 1996 spec
contradicts the 1994 document), so my spider would understand Allow:
fields in your robots.txt.
Yahoo and MSNBot make no mention of it and state clearly that they
follow the 1994 version of the spec:
http://help.yahoo.com/help/us/ysearch/slurp/index.html
http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIn
dexing.htm
I don't know if Googlebot will be as nice to you as it is to Nige (who
made me laugh), but even though Google says the same as Yahoo & MSNBot,
they also use "Allow" fields in their examples, so they clearly support
it.
My guess is that all of the big name bots support it, just because it
isn't hard to support. Robots.txt just isn't that hard to parse in the
first place. But I can't back up my assertion with anything other than
warm fuzzies which sound nice but are no substitute for hard facts (or
even documentation!) which I can't provide.
HTH
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
[Back to original message]
|