Controlling Web crawlers (search engine spiders) with robots.txt and meta tags

Web crawlers are programs created by search engines that go around the Internet and create an index of all collected information. The index allows search engines to take queries from users and show all matching pages. Web crawlers are also known as search engine spiders or robots.

You can specify that some parts of your site should be private and non-searchable. You can control how Web crawlers index your site at different levels - the entire site, specific directories, and individual pages. The keys are...

robots.txt - a simple text file that specifies instructions for a large number of pages on your site.
robots meta tag - a command that can be used on individual web pages.

Normally, if you want the search engines to be able to spider all the pages of your site, you don't require a robots.txt file or robots meta tags at all. However, you may need to prevent the search engines from indexing the printer-friendly versions of your pages which can be considered as duplicated content. Or, you may want to block the engines from spidering the pages containing meta refresh tags or other risky techniques that are often used for spamming, because the search engines may penalize your site for this.

robots.txt file

It's a simple text file that resides on the server and prevents somes pages of your site from being accessed by Web crawlers. When a search engine spider visits a site, it first looks for a robots.txt file. If it can't find one, it will spider all the pages. If it does find such a file, it will read it to find out which pages you don't want it to spider. And then it will spider only those pages which you haven't disallowed.

This file must be uploaded to the the top-level directory of of your site, not a subdirectory. Remember to use all lower case for the filename: "robots.txt", not "Robots.txt". Some sites recommend you to use robots.txt to block paid content you don't want people to see, but this isn't a good idea. The robots.txt file is a publicly available file. Anyone can see what sections of your site you want to hide by typing "http://www.YourSite.com/robots.txt". Requiring a username and password to access the premium content is much more effective.

Here are some examples of robots.txt...

To exclude all Web crawlers from the entire site:

User-agent: *
Disallow: /

All search engine spiders (indicated by "*") are instructed to not index any of your pages (indicated by "/").

To allow all crawlers full access:

User-agent: *
Disallow:

(or just create an empty robots.txt file, or don't use it at all).

To disallow all robots from crawling some directories and pages:

User-agent: *
Disallow: /cgi-bin/
Disallow: /print-ready/
Disallow: /refresh.htm

You need a separate "Disallow" line for every URL you want to exclude.

To exclude a single crawler:

User-agent: Googlebot
Disallow: /

You give instructions only to Google.

To allow a single robot:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

As you see, the rules of specificity apply, not inheritance.

Note also that globbing isn't supported. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you can't have lines like "User-agent: *bot" or "Disallow: /print/*.html".

robots meta tag

It's an alternative to robots.txt if you can't upload robots.txt to the root directory, or you simply need to restrict Web crawlers from a few pages on your site. The robots tag contains instructions to search engine spiders, as to whether or not to index this page and also to follow the links that are on it. By default, spiders both index and follow everything, unless you tell them not to.

Typically, a website owner would submit the main page and the robots would visit your site and collect all subpages and related links from your main page. So, you generally don't require the robots meta tag.

This tag looks similar to any meta tag, and should be added to the HEAD section of your page. Here are a few examples...

Unnecessary commands:

All these commands tell search engine spiders to index the page and follow links found on it. However, all search engines do this by default anyway.

The following commands disallow indexing of the page:

The spiders will still follow the links found on the page.

To instruct crawlers to not crawl links contained within the page:

The spiders will still index the page.

To make the page and the subsequent pages to which it links invisible to search engines:

They disallow both indexing and following the links.

To instruct search engines to not place your page in their cache:

Almost all search engines offer little cache links in their results, which bring to the latest copies of the pages. This tag is useful if your page is dynamic and you don't want your visitors to have access to its old content.

See also...

Search Engine Optimization
Optimizing the entire website for high search engine placement.

Body Copy
Optimizing the the actual text of the page for better search engine placement.

Meta Tags
Optimizing meta tags for better search engine placement.

Latest SEO news
Latest search engine optimization news from popular SEO sites.

What's New

Inexpensive Web Hosting
How to choose a fast and reliable service from the bulk of cheap hosting solutions.

Easy Website builders
Easy way to build a professional looking site for commercial use or just for fun.

Dedicated IP hosting
Benefits of using a static over shared IP address.

How to create a mobile version of your website

Controlling Web Crawlers

robots.txt file

robots meta tag

What's New

Sponsored links