Tuesday, November 5, 2024

Stopping and Directing Web Spiders

Not all agents, (otherwise known as crawlers, bots, robots and spiders), that visit your site will be of benefit. Even the “good” spiders such as the ones Google sends out to index your site may visit places that you don’t wish them to.

Malicious spiders or web strippers can cause you a great deal of grief by taking up server resources and increasing your bandwidth usage – this can result in excess bandwidth fees. People use web stripper applications (also known as offline browsers) to download your entire site. Sometimes their goal is fairly innocent – to go through your site while offline using a locally stored copy. In other circumstances, there may be a much more devious goal – plagiarism or hacking.

How to ban spiders

If you notice entries such as Teleport Pro and WebStripper in your traffic reporting, there is something you can do about it – either via a robots.txt file or through meta-tags.

Robots meta-tag

The robots exclusion tag is very simple to implement, but it’s mainly of benefit in keeping search engine spiders out of sensitive areas. Unfortunately, most web stripping applications ignore it.

The following META tags can be used and should be placed between your
-==-
tags:

-==-

This will prevent most search engine spiders and some web strippers from accessing the page.

Another method:

-==-

The page will be still be indexed, but any hyperlinks in that page will not be followed by the spider.

The best method is to combine the two:

-==-

The page will not be indexed and no links will be followed.

Robots.txt file

The robots.txt file is a more powerful strategy. It is a text file that contains instructions on what to allow/disallow agents and spiders to view and spider on your site. These rules are called The Robots Exclusion Standard.

ThinkHost doesn’t place a default robots.txt file in your web when you open an account, so you’ll need to create one in notepad and upload it via FTP to your docs directory. If you are using Microsoft Frontpage, save the file to the root directory of your disk based web and then upload via FrontPage’s standard HTTP:// publishing function.

Never use a blank robots.txt file as some search engines may see this as an indication that you don’t want your site spidered at all! Have at least one entry in the file and remember to skip a line between entries. Also ensure that the spider/agent that you are banning doesn’t turn out to be a legitimate software browser.

To prevent specific agents and spiders from having any access to your site, put these lines into the robots.txt file:

User-agent: NameOfAgent Disallow: /

You must record the name of the agent exactly as it appeared in your traffic reports; for example WebZip/4.0.

Skip a line between entries. You could do the same to exclude search engine spiders such as Googlebot. The “/” means disallow access to any directory.

You can also prevent access to specific folders:

User-agent: * Disallow: /cgi-bin/

In this example the * indicates “all” but please note that the wildcard (*) cannot be used on the Disallow line, use “/” instead.

Example robots.txt file

If you would like some sort of guide and further examples of a robots.txt file, you can take a look at the one we use on the ThinkHost site. View it here:

http://www.thinkhost.com/robots.txt

Our file is by no means complete, but it does contain a number of “idiot” bots that repeatedly attempt to strip our main site. Please be aware that robots.txt will not stop all web stripping activity as many strippers can fake agent names, it will help you save on bandwidth.

Good spiders

If you would like to be able to identify the “good” spiders that may visit your site, you can view a listing of the most popular search engines’ robots in our tutorial, “Understanding your web site traffic”:

http://www.thinkhost.com/services/kb/interpreting-statistics.shtml

Michael Bloch is the Business Operations Manager of ThinkHost, a USA based company that has been providing hosting solutions to the world since 1999.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles