How To Protect Your Files From Robots

Date: 2003-05-06

Optimising website pages for the search engines without running into trouble at the very least causes most of us webmasters to keep our brain cells finely honed, and at worst induces massive migraines!

One of the most common challenges for us all is how to present "clean", relevant and original content to a wide range of visitors.

You may find that you want to exclude search engine and other robots from all or part of your website for a
number of reasons including;

- you want to write similar pages for different types of visitors, but dont want to be penalized for duplication.

- you want to prepare pages or files that you dont want viewed.

Its very easy to achieve this by one of two means. You can use either a robot.txt file or a meta tag.

Lets de-mystify the process of writing these files and tags!

WHAT IS A ROBOTS.TXT FILE?

A robots.txt file is an instruction to the robots that travel the web, spidering the pages they find there. There
are several forms such a file can take - how often to traverse the site, if at all, and how.

The robots.txt file were considering here is an exclusion instruction - think of it as a "no entry" sign to robots.

You can write a file to exclude ("disallow") robots from all, or just part of your site.

Before you begin, you need to know how to write the .txt file.

Prepare it in a text editor such as Notepad. Dont attempt it in Word or an HTML editor such as FrontPage. When youre finished, save it as "robots.txt".

WHAT TO PUT IN YOUR ROBOTS.TXT FILE.

If you want to disallow all robots, youd write;

User-agent: *
Disallow: /

And thats all. Nothing else.

What about if you only want to exclude part of your site?

Lets pretend youre running a website which advises on raising children. Your material will be relevant to
surfers who live in many countries, but if you want them to really sit up and look, especially if you want them to buy from you, youll need to make sure that your content is region-specific, including references, idiom and spelling.

This situation is an ideal candidate for a robots exclusion .txt file.

Youve written all the pages you want to show to surfers in Canada, UK, and Australia in 3 separate directories which surfers will access by clicking on an appropriate link on your main pages.

The directories are;
/ca/
/uk/
/au/

To disallow robots from these directories write the following .txt file;

User-agent: *
Disallow: /ca/
Disallow: /uk/
Disallow: /au/

It may be that you want to allow some robots and disallow others.

In our example, it may be that you want to disallow just one robot, from one directory, in which case youd write;

User-agent: NastyBot Disallow: /ca/

Or to exclude all robots except one, which you want to traverse all of your site;

User-agent: NiceBot
Disallow:

User-agent: *
Disallow: /ca/

Note that if you dont enter a slash, that means the robots are permitted to read the whole site. "*" means all known robots, So in the last .txt file example, all robots are excluded from your Canadian directory, except NiceBot, which can read the whole site.

Easy isnt it!

WHERE TO PUT YOUR ROBOTS.TXT FILE

Once created, your file needs to go into your root directory. This is The same directory which contains your home page. Dont put it anywhere else, because the robots wont see it.

Note that you can only have ONE robots.txt file per site, so any modifications will need to be integrated into
your original file.

Note also that writing a no index robots.txt file means these pages wont be indexed, but that wont matter if
youve optimized your indexed pages properly.

In our Ca/UK/Au example above, your traffic will find your indexed global/US pages via the search engines, and will make the link to their "nationality" page from the point of entry to your site - weve all seen the little flag links on other sites - just put up a flag graphic and say for example; "UK Visitors Click Here".

If you want to learn more about exclusion robots.txt files, visit;

http://www.robotstxt.org/wc/exclusion-admin.html

If you prefer/need to exclude individual pages from being viewed by robots, you can do this using a robots.txt file, but you can also achieve it using a meta tag on your web page between the tags. The universal exclusion is as follows;

It may be that you want robots to index your pages, but not to archive them. There may be a range of reasons why you dont want search engines to keep copies of old pages - the most prevalent one among webmasters is because they are cloaking pages and dont want it known that the page served to search engines is a different one to that seen by surfers, but its also possible to have perfectly "legitimate" reasons for wanting to exclude parts of your site from public scrutiny.

Whatever your reason, if you want to avoid your page being indexed, the universal tag is;

For Google (the search engine you are most likely to want to avoid archiving your pages for its cache feature), the tag is;

.

To learn more about exclusion meta tags, visit;

http://www.robotstxt.org/wc/exclusion.html#meta

Dont be put off by the jargon; writing these files and tags is one of the easiest and most useful technical tasks you can undertake as a webmaster - write a file today and save yourself hundreds of hours!

(Erika Lawal copyright 2003)

About the author.

Erika Lawal writes Daily Internet Marketing Tips for webmasters desperately in search of cutting edge site optimization and marketing advice that produces results.

Get a FREE series of our Tips by visiting;dailyinternetmarketingtips

robotstxt@demandmail.com
http://www.dailyinternetmarketingtips.net

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация