|
Posted by Nikita the Spider on 09/13/07 15:27
In article <fc8hq8.2g0.1@dylanparry.com>,
Dylan Parry <usenet@dylanparry.com> wrote:
> Jukka K. Korpela wrote:
>
> >> 1. Someone posts the URL to a newsgroup.
> >> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> >> spider can just navigate its way to the url going through auto
> >> generated directory indexes.
> >>
> > 3. The page _was_ linked to from another page.
> >
> > 4. An indexing robot generates URLs automatically, more or less at random,
> > and tries them. It might for example try servers known to exist and append
> > to the server name some strings that are known to be common for web pages,
> > like /help.htm, /news.html....
>
> 5. Someone visits your page[1] and has the Google Toolbar (or others
> similar things) installed and reporting back to Google about the sites
> they are visiting, thus allowing Google to add the site to their index.
6. Someone sends the URL in an email via a mail service (like GMail)
that's also related to a search engine.
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Navigation:
[Reply to this message]
|