|
Posted by Jerry Stuckle on 03/12/07 14:46
sally@mander.com wrote:
> In article <12v70rmtpklsiee@corp.supernews.com>,
> gordonb.qm9ec@burditt.org says...
>>> I coded up a hit counter, then extended it to see who was reading my
>>> blog, by matching IP. The problem is that I am swamped by crawlers.
>> Nice crawlers for search engines identify themselves in the user
>> agent string. Also, nice crawlers obey robots.txt, so you can
>> exclude portions of your site if you want. Of course, that part
>> won't be indexed. Evil bots fake user agent strings of ordinary
>> users.
>>
>>> How can I detect a human, or a crawler? If I can handle one, I can
>>> negate it for the other.
>> Unfortunately, evil bots can hire humans to work for them, if you
>> had in mind such things as CAPTCHAs (decoding warped text in images).
>>
>>> Should I somehow user $_SERVER['USER_AGENT'] ? or something else?
>> The user agent string is one thing you can use, mostly to detect nice
>> crawlers.
>>
>>
>
> You might also want to set up what I call a "junk pot" as opposed to
> honey pot. (as far as I know I invented this one... ) The idea
> is to get them to waste their time somewhere away from your "real site"
>
> Take all the free space you have on your server and fill it with junk
> files. HTML zips jpgs anything at all. Just use as much and as many files
> as you can. If you can put up a dummy forum with guest write permissions
> thats cool too. Use lots of sub-directories with very long names too.
> Block access to it in your robots.txt so genuine crawlers dont waste
> their time. Put your main site under a single entry point 1 step below
> normal.
>
> sit back and watch the crawlers waste their time in your junk bin.
>
> Actually I redirect google into mine too.
> Its fun to watch their thieving archive fill up with junk instead of
> the copyright material they normally steal from everyone.
> (google just proves if youre big enough you can ignore any law)
>
> I'm sure the idea can be fine tuned by the good people here or
> perhaps somewhere more appropriate.
>
Let's see. Sit back and watch all those spiders use up all of your
bandwidth. So your hosting company shuts you down until you pay for
more bandwidth. Not a good idea, IMHO.
And if you think google is stealing copyright material I suggest you
talk to an attorney about 'fair use' rather than spouting unfounded
claims. That could get YOU in trouble.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================
Navigation:
[Reply to this message]
|