|
Posted by Jukka K. Korpela on 09/12/07 10:12
Scripsit Ben C:
> 1. Someone posts the URL to a newsgroup.
> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> spider can just navigate its way to the url going through auto
> generated directory indexes.
>
> What are the other 8?
To mention some other scenarios of having a page indexed without having been
linked to from any other web page*), here's one relatively obvious one and
one imaginary though realistic (we know such things are being done with
email addresses for spamming purposes):
3. The page _was_ linked to from another page.
4. An indexing robot generates URLs automatically, more or less at random,
and tries them. It might for example try servers known to exist and append
to the server name some strings that are known to be common for web pages,
like /help.htm, /news.html....
*) Of course an author cannot prevent linking by others. You tell the URL to
your friend, who tells it to his pal, who sets up a link. But this common
way of getting indexed against your will falls outside the current exercise.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Navigation:
[Reply to this message]
|