Matt Cutts Reins In Googlebot

July 8, 2006

90

Ahead of Search Engine Strategies 2006 conference in San Jose, Google’s best-known engineer suggested some basic approaches to making a visiting Googlebot behave when it drops by your website.

The Bot Obedience session at SES Summer 2006 inspired Cutts to post a useful entry on his blog about the subject.

Sometimes a site publisher may, for whatever reason, wish to keep certain pages from being indexed by Google. One example would be product pages posted online as part of a special promotion, where email subscribers to a company newsletter receive the address of the page. This gives the recipients a sneak peek at those pages before non-subscribers get to see them.

Protecting those pages from Googlebot can be accomplished at the site or directory level. Cutts suggests two approaches. The first one will be familiar to most publishers – the robots.txt file. Publishers who aren’t using Google Sitemaps may want to give it a try, as it offers a robots.txt file check process that shows how it holds up against the crawling process.

Pages can also be password protected. By setting up a .htaccess file (Cutts suggested how to do so), only users with a username and password can view the guarded pages.

“I’m not aware of any bot (including Googlebot) that guesses passwords, so this is quite effective at keeping content out of search engines,” he wrote.

Cutts also offered advice on meta tags, which should be placed at the top of the html page. Using a noindex meta tag will keep the page out of Google’s index. And the nofollow meta tag keeps the Googlebot from following outbound links off of a page.

Nofollow can be used within hyperlinks as an attribute, and works the same way as the meta tag does, but just for individual links. Cutts doesn’t recommend using them to “sculpt” a Googlebot’s visit, since it is easy to miss some links. Robots.txt or .htaccess along with meta tag usage as he advised would be a better approach.

If something has been indexed by Google already, a site publisher can request its removal. Cutts closed by noting it is easier to keep Google from crawling a page than getting it removed later.

That advice would have saved a certain school district some problems in June.

—

Add to Del.icio.us | Digg | Yahoo! My Web | Furl