Thursday, September 19, 2024

Link Reform To The Rescue

Okay, I can’t wait any longer. I’ve had this in the works for a couple months and now that Andy started a similar thread I am forced to release my findings.

There’s a major issue about how search engine spiders crawl sites and what sites they crawl. The robots.txt file is suppose to address this issue and has yet to do so in entirety so I’m making a proposal for a new spec or an ammendment to the W3C Robots.txt spec in this post. I’ll set up a site to garner support for the spec shortly.

The current robots.txt standard is failing search engines and visitors. It’s limitations are wasting bandwidth and resources of search engine crawlers and web site hosting providers and resulting in an unacceptable level of duplicate content in the major search engines’ indexes. This is despite their (se’s) best efforts to detect and erradicate duplicate content. I know this because its the number 1 issue I see each and every day with clients suffering from automatic penalties for spam from the search engines for stuff that isn’t spam.

It’s common practice for a web site owner to have multiple domains (aliases) resolving to one main site (primary site). It is even more common for these aliases to not employ an HTTP 301 or 302 redirect. If any redirect is used at all it is more often than not one of the following.

  1. Javascript redirect (often flagged as spam)
  2. Meta refresh (again, often flagged as spam) Update: As of mid Nov. ’04 Yahoo considers a refresh of 0 seconds a 301 redirect and a refresh of 1 – 15 seconds a 302 redirect.
  3. A redirect within a flash movie
  4. A programatic redirect employing jsp, cfm, php, asp, etc….

Worse still, many times no redirect mechanism is used at all. This is especially true in a shared hosting environment. Site owners just have their web host setup a dns record for each alias domain and point those aliases to the same ip/machine/virtual directory as the main site and think everything is just super-duper. doh! That’s when the automatic penalties begin and rankings drop.

Even when an HTTP 301 or HTTP 302 redirect is used they are often times merely pointed to another page that employs yet another redirect. Education on the proper use of browser redirects is anemic at best and even most search engine marketers don’t fully understand how to properly employ a redirect. Additionally, it is not feasible for the majority of an sem’s clients to even be able to make the necessary server-side changes (301, 302, etc…) that tell a search engine spider what to index and what not to index.

Even if a web site owner changes the robots.txt file within the web root of their site, their problems aren’t solved. The robots.txt exclusion standard doesn’t allow them to designate alias domain names (aliases) they don’t want spidered. Additionally the w3c states there can be only one robots.txt on each site and it must reside in the root directory otherwise it will be ignored.

But wait, there is a META Robots tag that can be implemented on individual pages, doesn’t that solve the problem? In short, no.

Even if a web site owner can change the tag and tell
a spider to not index a particular page, it can’t be applied to a specific domain since each alias is serving up the exact same content. Every aliased domain uses the same files on every site hosted in that virtual directory. This can be changed programatically but again: most site owners don’t have access to the necessary tools or resources to accomplish such a task.

The syntax is and the allowed list of terms in content are ALL, INDEX, NOFOLLOW, NOINDEX.

The robots.txt limitations aren’t a new issue at all. They have been brought to the attention of the W3C several times before. But change is needed now! As the internet continues to grow and evolve there are more and more web pages to be indexed. Some of them are duplicates but even more are not. In order for the major search engines to be able to find and index these new web pages they’ll need to be able know which pages or domains they don’t need to index. The only way for them to know that is for us to tell them. If internet users are going to get the absolute best results for their searches something must be done and it must be done now.

In order for us to quantifiably determine the current impact of this duplicate
content issue we must have solid data to look at.

We’ll need to find out.

  1. %age of domains that employ the robots.txt standard
  2. %age of domains currently displaying similar content
  3. %age of urls (from each search engine) flagged as “duplicate content” that continue to get crawled despite this flagging.

Update: I’ve been told from an undisclosed Google source that less than 2% of Google’s index consists of duplicate content. Even with that seemingly small number it equates to 160,000,000 pages containing duplicate content in Google’s 8,000,000,000 page index. I think that’s enough data alone to raise a few eyebrows.

Once we have accurate data we can show, without a shadow of a doubt, that this change is a vital and necessary step in the evolution of the internet.

One Possible Solution
I propose an additional file be created that resides in the root directory of a web site. The file name is not important (linkreform.txt anyone?) but it’s functionality is critical. This file should address the single biggest issue facing search relevance: duplicate content, and the effects duplicate content is having on the search industry as a whole.

Possible Syntax A
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01a 2004/11/15 01:33:07 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com

# First Alias – non www version of url
Alias-Domain: mainurl.com

# Second Alias – .net version of url
Alias-Domain: www.mainurl.net
Alias-Domain: mainurl.net

# Additional Alias – completely different domain name
Alias-Domain: www.aliasurl.com
Alias-Domain: aliasurl.com

# Additional Alias – completely different domain name
Alias-Domain: www.aliasurl.net
Alias-Domain: aliasurl.net

# Additional Alias – completely different domain name
Alias-Domain: www.aliasurl-a.com
Alias-Domain: aliasurl-a.com

# Additional Alias – completely different domain name
Alias-Domain: www.aliasurl-a.net
Alias-Domain: aliasurl-a.net
=====================================

Here’s another proposed doc format that simply states the main url the spiders should crawl and allows users to anonymously own and point several domains to the same place without giving their competitors any information about their aliases.

Possible Syntax B
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01b 2004/11/15 01:38:11 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com

The easiest implementation would be for the W3C to ammend the robots.txt specification and allow the following line to be added to the file.

Possible Syntax C
=====================================
# robots.txt for http://www.mainurl.com/
#
# $Id: robots.txt,v 1.01b 2004/11/15 01:41:23 jdowdell
#
# Main url preferred crawler starting point and to be used #in search results
User-agent: *
URL-to-crawl: www.mainurl.com

By implementing this new standard we could…

  1. Reduce bandwidth of all major search crawlers
  2. Reduce resources needed to power major crawlers
  3. Reduce cost of hosting a web site and demand on individual web site resources.
  4. Reduce the number of pages appearing in a search engine index that are of the same content but on a different domain.
  5. By doing no.4 search engines could focus more fine tuned efforts to thwart the practice of publishing duplicate content as an effort to rank higher.
  6. Increase end-user satisfaction rates by decreasing the amount of noise associated with typical search results.
  7. Facilitate more accurate results across all major search engines by reducing the number of duplicate content pages from non-spammers from their indexes.

Possible Side Effects: Financial and Sociological (both good and bad)

  1. Reduction in the amount of ppc revenue generated by search engines since there would be more relevant results in the natural section.
  2. Conversely, it may increase ppc revenue since results are more accurate.
  3. Society isn’t ready for the “less is more” approach just yet since most internet users don’t know the difference between natural results and paid listings.
  4. Search engines save money on overhead by using less resources for crawling & web hosting providers save money on bandwidth since less requests would be made.
  5. Could completely backfire and engines that support it could lose face with visitors and advertisers.

Tim Bray had made some recommendations for changes previously as well. He points to and this

Related articles:
Tag Issues
And Another

Some more previously proposed changes are here. This link deals with the problem of having multiple names for the same content. Most sites can be referenced by several names.

To avoid duplication crawlers usually canonicalize these names by converting them to ip addresses. When presenting the results of a search, it is desirable to use the name instead of the ip address. Sometimes it is obvious which of several names to use (e.g. the one that starts with www), but in many cases it is not. The robots.txt file should have an entry that states the preferred name for the site.

Those recommendations were proposed by Mike Frumkin and Graham Spencer of Excite on 6/20/96 but nothing (that I know of) has been done as of yet.

Back in 1996 during a breakout session concerning additions to the robots.txt standard Martijn Koster of AOl made some good points and basically that simpler is better and the robots.txt is simple. This report reminds me of many meetings I’ve taken part in where everyone has great ideas and the issues are brough up but nothing ever comes of it.

I’m proposing a call to action and I will create linkreform.com as a hub of sorts to garner support and feedback from the internet users and developers alike who want to see something happen [if there is enough interest in doing so]. Whether it’s through the W3C or a group effort by a bunch of nobodies isn’t my concern. Getting something accomplished that helps out web site owners that aren’t as technically adept as the average Marketing Shift reader is my goal and to that goal will I work.

Suggestions, Thoughts, Comments
If you would like to provide feedback regarding these ideas please just submit a comment on this post for the time being. If there is enough support we’ll create a site dedicated to fixing this issue. A few charter members of the W3C have pledged their support for this initiative. When I asked Tim about his thoughts on whether or not I was nuts he responded with this.

=================================================
Me: Am I nuts with this idea?
Tim: The idea is not nuts.
Tim: Also check out http://www.w3.org/2001/tag/issues.html#siteData-36
=================================================

Update: I had previously stated that Tim Bray had “pledged his support” and I misstated that. My apologies to Tim. Tim said that if he has any good ideas on how to promote this idea he will pass them on. I need to get some sleep.

Jason Dowdell is a technology entrepreneur and operates the Marketing Shift blog.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles