Saturday, December 14, 2024

Controlling Search Engine Spiders

Share

Sometimes you have pages on your website that you don’t want the search engines to see – maybe they’re not optimized yet, or maybe they’re not quite relevant to your site’s theme. In other cases, you want to get rid of some annoying search robot that’s cluttering up your logs. Whatever your reason is for wanting to keep the spiders under control, the best way to do so, by far, is to use a “robots.txt” file on your website. Robots.txt is a simple text file that you upload to the root directory of your website. Spiders read this file first, and process it, before they crawl your site. The simplest robots.txt file possible is this:

User-agent: *
Disallow:

That’s it! The first line identifies the user agent – an asterisk means that the following lines apply to all agents. The blank after the “Disallow” means that nothing is off limits. This robots.txt file doesn’t do anything – it allows all user agents to see everything on the site.

Now, let’s make it a little more complex – this time, we want to keep all spiders out of our /faq directory:

User-agent: *
Disallow: /faq/

See how simple it is? The trailing slash is necessary to indicate that this is a directory. We can also add directories:

User-agent: *
Disallow: /faq/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /info/about/

That was easy, but what if we want to disallow access to only one file? It’s simple:

User-agent: *
Disallow: about.html
Disallow: /faq/faqs.html

Now let’s get specific. So far, we’ve created rules that apply to all spiders, but what about an individual spider? Just use its name.

User-agent: Googlebot
Disallow: /faq/

Now, let’s combine individual spider control with a catch-all:

User-agent: Googlebot
Disallow: /
User-agent: *
Disallow: /faq/

This set of commands tells Googlebot to take a hike – the slash character (“/”) by itself means that the entire site is disallowed. For all other user-agents, we’ve just kept them out of the /faq directory.

Each record in a robots.txt file consists of a user-agent line, followed by one or more Disallow directives. The blank line between the two user-agent records is necessary for the file to be processed properly.

If you’d like to add comments, use the “#” character like so:

# keep spiders out of the FAQ directory
User-agent: *
Disallow: /faq/

You can use any text editor that saves text in a web-friendly format. I like Notepad or Unixedit, both of which are free.

If you don’t feel like using a text editor, or just don’t want to deal with creating your own robots.txt by hand, click here for a program that will help you.

There’s a nice robots.txt validator here – use this site after you’ve uploaded your robots.txt file to make sure that it will really work.

The following is a listing of the four major search engine spiders and their associated user-agents.

# Altavista (Altavista search engine only)
User-agent: Scooter

# FAST/AllTheWeb (AllTheWeb search engine)
User-agent: fast

# Google (Google Search Engine)
User-agent: Googlebot

# Inktomi (Anzwers, AOL, Canada.com, Hotbot, MSN, etc.)
User-agent: slurp

My policy is to only exclude search engine spiders from pages that may contain words that aren’t relevant to my site’s theme. More often, we use robots.txt to keep away all of the annoying little spiders that aren’t from the search engines, but that’s a story for another day!

I wish you success…

Dan Thies is a well-known writer and teacher on search engine marketing. He offers consulting, training, and coaching for webmasters, business owners, SEO/SEM consultants, and other marketing professionals through his company, SEO Research Labs. His next online class will be a link building clinic beginning March 22

Table of contents

Read more

Local News