For a while now webmasters have fretted over why all of the pages of their website are not indexed. As usual there doesn’t seem to be any definite answer. But some things are definite, if not automatic, and some things seem like pretty darn good guesses.
So, we scoured the forums, blogs, and Google’s own guidelines for increasing the number of pages Google indexes, and came up with our (and the community’s) best guesses. The running consensus is that a webmaster shouldn’t expect to get all of their pages crawled and indexed, but there are ways to increase the number.
PageRank
It depends a lot on PageRank. The higher your PageRank the more pages that will be indexed. PageRank isn’t a blanket number for all your pages. Each page has its own PageRank. A high PageRank gives the Googlebot more of a reason to return. Matt Cutts confirms, too, that a higher PageRank means a deeper crawl.
Links
Give the Googlebot something to follow. Links (especially deep links) from a high PageRank site are golden as the trust is already established.
Internal links can help, too. Link to important pages from your homepage. On content pages link to relevant content on other pages.
Sitemap
A lot of buzz around this one. Some report that a clear, well-structured Sitemap helped get all of their pages indexed. Google’s Webmaster guidelines recommends submitting a Sitemap file, too:
· Tell us all about your pages by submitting a Sitemap file; help us learn which pages are most important to you and how often those pages change.
That page has other advice for improving crawlability, like fixing violations and validating robots.txt.
Some recommend having a Sitemap for every category or section of a site.
Speed
A recent O’Reilly report indicated that page load time and the ease with which the Googlebot can crawl a page may affect how many pages are indexed. The logic is that the faster the Googlebot can crawl, the greater number of pages that can be indexed.
This could involve simplifying the structures and/or navigation of the site. The spiders have difficulty with Flash and Ajax. A text version should be added in those instances.
Google’s crawl caching proxy
Matt Cutts provides diagrams of how Google’s crawl caching proxy at his blog. This was part of the Big Daddy update to make the engine faster. Any one of three indexes may crawl a site and send the information to a remote server, which is accessed by the remaining indexes (like the blog index or the AdSense index) instead of the bots for those indexes physically visiting your site. They will all use the mirror instead.
Verify
Verify the site with Google using the Webmaster tools.
Content, content, content
Make sure content is original. If a verbatim copy of another page, the Googlebot may skip it. Update frequently. This will keep the content fresh. Pages with an older timestamp might be viewed as static, outdated, or already indexed.
Staggered launch
Launching a huge number of pages at once could send off spam signals. In one forum, it is suggested that a webmaster launch a maximum of 5,000 pages per week.
Size matters
If you want tens of millions of pages indexed, your site will probably have to be on an Amazon.com or Microsoft.com level.
Know how your site is found, and tell Google
Find the top queries that lead to your site and remember that anchor text helps in links. Use Google’s tools to see which of your pages are indexed, and if there are violations of some kind. Specify your preferred domain so Google knows what to index