How to keep bad robots, spiders and web crawlers away

Many so called webbots or web spiders are currently used for many different things on the Internet. Examples include search engines that use them to catalog the Internet, email marketing people that search for email addresses and many more. For a description of such robots check out The Web Robots Faq.

Table of contents

Some of those robots are welcome, others are not. This page will show you how I catch the bad ones, and how I stop them from bothering me again.

Definition of a bad robot

I do not like robots that have one or more of the following features: In every case I try to find out what the robot is used for, and, if I decide that I do not want it anymore, I block access either for that particular robot or for a particular site.

How to identify bad robots

Most methods below rely on the fact that you have access to the access logs on the web server. You need to check them regularly for unauthorized accesses.

Banning bad robots from the site

This is done with a few lines in the .htaccess file. This file contains directives for the web server and are used in this case to redirect all accesses from bad robots to one page, which contains a short explanation why the robot has been banned from the site.

There are two ways to ban a robot, either by banning all accesses from a particular site or by banning all accesses that use a specific id to access the server (most browsers and robots identify themselves whenever they request a page. Internet explorer for example uses Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", which must be interpreted as "I'm a netscape browser - well, actually I'm just a compatible browser named MSIE 4.01, running on windows 98" (A netscape browser identifies itself with "Mozilla"). In both cases the following lines are used at the beginning of the .htaccess file (note: this works with recent apache web servers, other servers may need other commands):

     RewriteEngine on
     Options +FollowSymlinks
     RewriteBase /
To ban all access from IP numbers 209.133.111.* (this is the imagelock company) use
     RewriteCond %{REMOTE_HOST} ^209.133.111..*
     RewriteRule ^.*$ X.html [L]
which means: if the remote host has an IP number that starts 209.133.111 rewrite the file name with X.html and stop rewrites.

If you want to ban a particular robot or spider, you need its name (check your access log). To ban the inktomi spider (called Slurp), you can use

     RewriteCond %{HTTP_USER_AGENT} Slurp
     RewriteRule ^.*$ X.html [L]
In order to ban several hosts and/or spiders, use
     RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
     RewriteCond %{HTTP_USER_AGENT} Spider [OR]
     RewriteCond %{HTTP_USER_AGENT} Slurp
     RewriteRule ^.*$ X.html [L]
Note the "[OR]" after each but the last RewriteCond.

The Robot Trap

Three traps are set on this web site:

Download the traps

If you want to install the traps you can download them here. You need to customize them as they will not work correctly right out of the box. Note that they do work with the servers I'm using, but depending on the server software and the configuration of your server they may or may not work for you. Be sure to test everything out before leaving it in place.