Thursday, September 19, 2024

Disabling Google and Other Search Engines From Crawling a Site

Reader question: I have a online database of horror movies, and I have a good Google rank. In my traffic logs I noted the last month a really growing of the bandwidth: one of the most important browsers of the server logs is Googlebot, so this traffic was generated for the spidering engine of Google. I have the 20 Gb bandwidth limit and I don’t want to pay for excess, so I disable Google into my Web site. My question is:

If I disable Google to my Web site, its possible Google.com erase or drop down my Web site for his directory?

Many thanks for your time and keep up the good work.

Answer: Many thanks for posting this question because Web server issues and excluding robots are a very important aspect of search engine marketing (SEM). The reader did not specifically state how he kept Googlebot from spidering his site. I am assuming that the reader used the Robots Exclusion Protocol.

Robots Exclusion Protocol

The Robots Exclusion Protocol is a means of instructing robots (or spiders) from crawling a site. With the Robots Exclusion Protocol, Web site owners can instruct search engine spiders to not index individual Web pages, subdirectories, or even an entire site. Instructions can also be tailored for individual search engines.

There are two types of robots exclusion: a meta tag or a text file.

To let Google know that you do not want a page crawled, you can create the following meta tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">

To let all search engine spiders know that you do not want a page crawled, you can create the following meta tag:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

For this tag to be effective on a whole site, you will have to place this tag on every page of your site. This process can be quite boring and time consuming. For that reason, I prefer to use the robots exclusion text file, commonly referred to as robots.txt, because it can easily be applied to an entire site.

The robots.txt is a text file that you place on your server that instructs search engine spiders to NOT record the information in specified areas on your Web site, and not to follow the links on your Web site. In other words, text file lets the search engine spiders know which sections of your site are off limits.

I usually create my robots.txt files in NotePad (PC) or SimpleText (Mac). But you can create simple text files in HTML software such as Dreamweaver.

Google will request the robots.txt file before trying to index any page within your site. For example, if do not want Google to record any of the information on the site, type the following text into a text editor:

User-agent: Googlebot
Disallow: /

Be sure to save the file as robots.txt. Do not use any other file extension. If you save the file as a Word document and call it robots.doc, Google will ignore that file.

When search engines crawl to frequently

I understand the reader’s concern about bandwidth. If Google or any search engine crawls a site too frequently, it takes up bandwidth. All of us pay for bandwidth.

However, when you instruct Google (or any search engine) to not crawl your site, you are essentially communicating, “Don’t show my Web pages in your search results.”

I do not believe the reader’s intention was to exclude all of his Web pages from Google search engine results pages (SERPs). He just wants Google not to request pages from his server so often.

Google actually has a Web page with this information and an email address. This is a direct quote from Google’s Webmaster FAQs page:

“Please send an email to googlebot@google.com with the name of your site and a detailed description of the problem. Please also include a portion of the weblog that shows Google accesses, so we can track down the problem more quickly on our end.”

The URL for the information on this page is at http://www.google.com/webmasters/faq.html.

When to use the Robots Exclusion Protocol

Some content is not important to site visitors and search engines, such as items in a CGI-BIN directory. When your target audience searches for information, they are not interested in your site’s programs that generate your forms or your drop-down menus. They are not interested in a section of a Web site that is under construction. They are not interested in redundant content, either. Using the Robots Exclusion Protocol ensures that unnecessary information is not shown in search results pages.

For more details about the Robots Exclusion Protocol, please visit: http://www.robotstxt.org/wc/faq.html.

Shari Thurow is Marketing Director at Grantastic Designs, Inc., a full-service search engine marketing, web and graphic design firm. This article is excerpted from her book, Search Engine Visibility (http://www.searchenginesbook.com) published in January 2003 by New Riders Publishing Co. Shari can be reached at shari@grantasticdesigns.com.

Shari Thurow Answers SEO Questions: Click Here For Free Answers

Related Articles

17 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles