Creating a robots.txt
file
By Sumantra Roy
Some people believe
that they should create different pages for different search
engines, each page optimized for one keyword and for one search
engine. Now, while I don't recommend that people create different
pages for different search engines, if you do decide to create
such pages, there is one issue that you need to be aware of.
These pages,
although optimized for different search engines, often turn
out to be pretty similar to each other. The search engines
now have the ability to detect when a site has created such
similar looking pages and are penalizing or even banning such
sites. In order to prevent your site from being penalized
for spamming, you need to prevent the search engine spiders
from indexing pages which are not meant for it, i.e. you need
to prevent AltaVista
from indexing pages meant for Google
and vice-versa. The best way to do that is to use a robots.txt
file.
You should create
a robots.txt file using a text editor like Windows Notepad.
Don't use your word processor to create such a file.
Here is the basic
syntax of the robots.txt file:
User-Agent: [Spider
Name]
Disallow: [File Name]
For instance,
to tell AltaVista's spider, Scooter, not to spider the file
named myfile1.html residing in the root directory of the server,
you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Google's
spider, called Googlebot, not to spider the files myfile2.html
and myfile3.html, you would write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course,
put multiple User-Agent statements in the same robots.txt
file. Hence, to tell AltaVista not to spider the file named
myfile1.html, and to tell Google not to spider the files myfile2.html
and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to
prevent all robots from spidering the file named myfile4.html,
you can use the * wildcard character in the User-Agent line,
i.e. you would write
User-Agent: *
Disallow: /myfile4.html
However, you
cannot use the wildcard character in the Disallow line.
Once you have
created the robots.txt file, you should upload it to the root
directory of your domain. Uploading it to any sub-directory
won't work - the robots.txt file needs to be in the root directory.
I won't discuss
the syntax and structure of the robots.txt file any further
- you can get the complete specifications from http://www.robotstxt.org/wc/norobots.html
Now we come to
how the robots.txt file can be used to prevent your site from
being penalized for spamming in case you are creating different
pages for different search engines. What you need to do is
to prevent each search engine from spidering pages which are
not meant for it.
For simplicity,
let's assume that you are targeting only two keywords: "tourism
in Australia" and "travel to Australia". Also, let's assume
that you are targeting only three of the major search engines:
AltaVista, HotBot and Google.
Now, suppose
you have followed the following convention for naming the
files: Each page is named by separating the individual words
of the keyword for which the page is being optimized by hyphens.
To this is added the first two letters of the name of the
search engine for which the page is being optimized.
Hence, the files
for AltaVista are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for
HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for
Google are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier,
AltaVista's spider is called Scooter and Google's spider is
called Googlebot.
A list of spiders
for the major search engines can be found at http://www.jafsoft.com/searchengines/webbots.html
Now, we know
that HotBot uses Inktomi and from this list,
we find that Inktomi's spider is called Slurp. Using this
knowledge, here's what the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines in
the robots.txt file, you instruct each search engine not to
spider the files meant for the other search engines.
When you have
finished creating the robots.txt file, double-check to ensure
that you have not made any errors anywhere in it. A small
error can have disastrous consequences - a search engine may
spider files which are not meant for it, in which case it
can penalize your site for spamming, or, it may not spider
any files at all, in which case you won't get top rankings
in that search engine.
An useful tool
to check the syntax of your robots.txt file can be found at
http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical errors in the robots.txt
file, it won't help you correct any logical errors, for which
you will still need to go through the robots.txt thoroughly,
as mentioned above.
Article by Sumantra
Roy. Sumantra is one of the most respected search engine positioning
specialists on the Internet.
For more advice
on how you can take your web site to the top of the search
engines, subscribe to Sumantra Roy's FREE
newsletter