Saturday, December 21, 2024

Matt Cutts Teaches Us To Crawl

Share

The Google engineer followed up his WebmasterWorld PubCon Boston discussion of Google’s Bigdaddy infrastructure update and “crawl cache” with a lengthier look at the topic.

Matt Cutts Teaches Us To Crawl Matt Cutts Discusses Cache Crawling
Cutts’ latest blog post reviewed Bigdaddy’s crawl-caching proxy in greater depth. He even provided helpful charts to illustrate the process.

As a webmaster, one may see numerous fetches from multiple Googlebots, each of them using some bandwidth while accomplishing their appointed rounds. It makes for a more accurate Google index, but the site impact has given some webmasters fits over the bandwidth usage.

The proxy used in the Bigdaddy infrastructure works like other proxies. It handles the effort of retrieving pages from websites, and fulfills requests from the various Google crawlers. Instead of multiple spiders hitting a website, they hit the cache instead.

Cutts breaks down the crawl caching in a summary during his post (spacing added; we like Matt, but we’d really like him to enjoy the Return key a bit more often 🙂 :

So the crawl caching proxy work like this: if service X fetches a page, and then later service Y would have fetched the exact same page, Google will sometimes use the page from the caching proxy.

Joining service X (AdSense, blogsearch, News crawl, any Google service that uses a bot) doesn’t queue up pages to be include in our main web index. Also, note that robots.txt rules still apply to each crawl service appropriately. If service X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching the page, service Y wouldn’t get the page from the caching proxy.

Finally, note that the crawl caching proxy is not the same thing as the cached page that you see when clicking on the “Cached” link by web results. Those cached pages are only updated when a new page is added to our index.

It’s more accurate to think of the crawl caching proxy as a system that sits outside of webcrawl, and which can sometimes return pages without putting extra load on external sites.
The essential goal of the proxy, to reduce bandwidth, seems to have worked to Google’s satisfaction. Cutts wrote that “it was working so smoothly that I didn’t know it was live.”

Add to document.write(“Del.icio.us”) | DiggThis | Yahoo! My Web | PreFound.com

Bookmark Murdok:

David Utter is a staff writer for Murdok covering technology and business.

Table of contents

Read more

Local News