On sites with more than a few thousand pages, Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about.
These pages show up in Google’s main index as a listing of the URL, which means that the Googlebot is aware of the page. But they do not show up as an indexed page. When the page is listed but not indexed, the only way to find it in a search is if your search terms hit on words in the URL itself. Even if they do hit, these listed pages rank so poorly compared to indexed pages, that they are almost invisible. This is true even though the listed pages still retain their usual PageRank.
I have been complaining about this since April 2003, and it has become more visible in 2004. There is no method to Google’s madness, which is another way of saying that this phenomenon is not characteristic of any particular type of site. It is happening across the entire landscape of large sites. I find it on www.johnkerry.com, on searchenginewatch.com, and dozens of other large sites I checked. Our own site, www.namebase.org, is a clean example of this, and I will use it to show how to do searches that expose this phenomenon.
You have to know what to look for and how to look for it. First of all, a listing consists of the URL in place of the title on Google’s search results pages, in blue, and below this in a smaller font there appears a “Similar pages” link in blue. That’s all. An indexed page has a real title, almost always has a snippet in black, shows the URL and the size of the page in green, and then has “Cached” and “Similar pages” links in blue. (On NameBase we disallow Google’s cache copy, so the “Cached” link is legitimately missing on all of our pages.) These two types of links are very different and immediately obvious. However, you should set your Google preferences to 100 links per page, because the listed links are buried much deeper in the results.
Before I explain how to isolate the listed links from the indexed links, there are two cases I know of where a listing is normal for Google. These are exceptions to the phenomenon that interests me in this essay. Neither is relevant to NameBase, but I have to mention them in case you want to examine other sites. The first exception is when a site has certain directories disallowed in their robots.txt file. Google will habitually list the URLs in the disallowed directory but not index them. (This itself is an invasion of privacy, because filenames can be very revealing — but that’s a rant for another day.)
The second exception is when there are ID numbers at the end of the URL, particularly if these numbers follow a question mark in the URL. Google avoids any URL that looks like it might be a problem. Sometimes this number is a session ID number from a shopping cart site. If Google followed these links, the crawler might end up grabbing thousands of duplicate pages, distinguished only by the session ID.
Now that you know what I’m not talking about, here is how you can investigate a site. First you have to find a word on the site that is present on nearly every page of the site. On some of the sites we looked at, the word “reserved” from the copyright notice (as in “All rights reserved”) worked fairly well. On NameBase, we have “home page” at the bottom. The “site:” command is used in conjunction with “home page.” By putting “home page” in quotes, the search is more accurate:
site:www.namebase.org “home page”
That search asks for all pages from www.namebase.org that include the phrase “home page.” These will be indexed pages. If the page was merely listed, Google wouldn’t be aware that this phrase is at the bottom of the page. Next you can request all pages that do not contain this phrase, by inserting a minus sign in front of the phrase:
site:www.namebase.org -“home page”
In the case of NameBase, this became a problem that I first noticed in April 2003. That was the month when Google underwent a massive upheaval, which I describe in my Google is broken essay. When that essay was written two months after the upheaval, it would have been speculative to claim that the listed URL phenomenon was a symptom of the 4-byte docID problem described in the essay. It was too soon. But sixteen months later, the URL listings are beginning to look very widespread and very suspicious. It’s a major fault in Google’s index, it is getting worse, and it is much more than a mere temporary glitch.
Another curiosity emerged in August 2003, two months after my “Google is broken” essay. Google started showing supplemental results from an entirely separate index. If you run out of regular results you will often see the label “Supplemental Result” in green on the last page of available links. At that time Google briefly stated on their site that they “augment results for difficult queries by searching a supplemental collection of web pages.” A representative from Google had little to add to this, but did concede that it is an entirely separate index, and then threw out a few words of spin. It sounded like a cover story. I believe that this new index was started due to a capacity problem in the main index and the need to develop new software.
Google is dying. It broke sixteen months ago and hasn’t been fixed. It looks to me as if pages that have been noted by the crawler cannot be indexed until some other indexed page gives up its docID number. Now that Google is a public company, stockholders and analysts should require that Google give a full accounting of their indexing problems, and what they are doing to fix the situation. The SEC should get involved too, because this continuing decline in the quality of Google’s main index is a significant risk factor that should have been mentioned in the prospectus.
Daniel Brandt operates Public Information Research, PO Box 680635, San Antonio TX 78268-0635
Tel:210-509-3160
Fax:210-509-3161
Nonprofit publisher of NameBase, http://www.yahoo-watch.org, and http://www.google-watch.org/
namebase@earthlink.net