|
Posted by sally on 03/12/07 08:29
In article <12v70rmtpklsiee@corp.supernews.com>,
gordonb.qm9ec@burditt.org says...
> >I coded up a hit counter, then extended it to see who was reading my
> >blog, by matching IP. The problem is that I am swamped by crawlers.
>
> Nice crawlers for search engines identify themselves in the user
> agent string. Also, nice crawlers obey robots.txt, so you can
> exclude portions of your site if you want. Of course, that part
> won't be indexed. Evil bots fake user agent strings of ordinary
> users.
>
> >How can I detect a human, or a crawler? If I can handle one, I can
> >negate it for the other.
>
> Unfortunately, evil bots can hire humans to work for them, if you
> had in mind such things as CAPTCHAs (decoding warped text in images).
>
> >Should I somehow user $_SERVER['USER_AGENT'] ? or something else?
>
> The user agent string is one thing you can use, mostly to detect nice
> crawlers.
>
>
You might also want to set up what I call a "junk pot" as opposed to
honey pot. (as far as I know I invented this one... ) The idea
is to get them to waste their time somewhere away from your "real site"
Take all the free space you have on your server and fill it with junk
files. HTML zips jpgs anything at all. Just use as much and as many files
as you can. If you can put up a dummy forum with guest write permissions
thats cool too. Use lots of sub-directories with very long names too.
Block access to it in your robots.txt so genuine crawlers dont waste
their time. Put your main site under a single entry point 1 step below
normal.
sit back and watch the crawlers waste their time in your junk bin.
Actually I redirect google into mine too.
Its fun to watch their thieving archive fill up with junk instead of
the copyright material they normally steal from everyone.
(google just proves if youre big enough you can ignore any law)
I'm sure the idea can be fine tuned by the good people here or
perhaps somewhere more appropriate.
[Back to original message]
|