Friday, September 20, 2024

The Great Google Filter Fiasco: Mom & Pop Take it on the Chin

On or about November 17, 2003, countless English-language ecommerce sites no longer appeared near the top of the rankings when their owners used the search terms that they considered most important. Four days later, I discovered that by adding a nonsense exclusion term, the links returned by Google shifted dramatically, and the results were very close to what these site owners had come to expect over the last few months.

A filter was in place, and it could be defeated by using one, or sometimes two, exclusion terms. If an exclusion term consists of characters that would never be found on a web page, then normally the addition of this excluded term to your usual terms will make little or no difference. Under normal circumstances, a search for callback service should return the same links as a search for callback service -qwzxwq because no sane web page has the term qwzxwq on the page.

There have been thousands of posts about the November update on various forums where webmasters trade information. When the exclusion trick was discovered on the evening of November 21, I assumed that we had only the two-day weekend to discover more about this filter. A similar trick using a hyphen between two keywords worked briefly for about three days prior to November 21, at which point it stopped working. It was this previous trick, discovered by someone else, that made me aware of the filter in the first place. When it stopped working I already knew what to look for, and was in a position to try other tricks. That’s how I stumbled onto the exclusion word as a means to turn off the filter. I announced the new trick on a popular forum and other webmasters confirmed that it worked on their keywords. A day later some webmasters reported that if you use three words in your favorite search term, then you often need two different exclusion terms after them to defeat the filter.

By Wednesday of the following week, it was still working. I didn’t expect this, and started this Scroogle site using a script that compared the top 100 results from Google with exclusion terms, against the top 100 results for the same terms, but without the exclusion. I began recording the terms entered by visitors to the site, along with the “casualty rate” for those terms. This rate is the number of links in the unfiltered top 100 that were missing from the filtered top 100.

This so-called Hit List is a moving window of the most recent 10,000 terms entered by visitors, minus duplicates and some porn. (You should be aware that porn terms are underrepresented on my Hit List of filtered terms, because I delete the most offensive ones.) There is also a cutoff on the low end to keep the file size reasonable. This means that you may not see many terms with casualty rates under five or so. The Hit List is only a sampling of the terms that visitors to this site have entered, mostly in the last 24 hours. It’s not definitive, and all it can do is give you an idea of which terms get filtered.

It’s confusing, that’s for sure

There is no easy answer about which terms to avoid. Two-word terms are often more deadly than either of the words alone. With three-word terms there are even more variables, and just moving the terms around sometimes makes a big difference. There is some sort of initial threshold that determines whether your search terms will be subjected to the filter. Perhaps it is a probability variable. It seems that information sites, such as .edu, .org and .gov have been exempted from the filter, either due to their domain name or because the terms used to reach those sites don’t show up in the so-called “filter dictionary.” Blogs have been unaffected also. The target is ecommerce, and only English-language sites have been hit.

Once your search terms are found in the dictionary (this is an oversimplification, but will do for now), then the pages returned by the search are analyzed for their “over-optimization” on those terms. The use of the terms in the title, in headlines, in links (domain, path and filename) and in anchor text attract extra attention. Word density and incoming links may also play a role. There is evidence that the over-optimized keywords for ecommerce pages are precomputed. This would mean that even if you clean up your page, you still have to wait for the next run of this computation. All Google would have to do is store a few suspect words along with each web page from dot-com sites, according to some algorithm that calculates various optimization characteristics on that page. They can do this by crawling their own database. Such precomputation is likely because this would mean that most of the computational overhead is done off-line, and only once per page.

Each of these two thresholds — the dictionary lookup and the page parsing for optimized terms — is more complex than represented here. More than one layer of analysis is involved, and there are no easy answers. Some have suggested that the final level introduces a bit of randomization, solely for the purpose of keeping us all guessing. [ NOTE: Twelve days after this essay was written, I and others began to suspect that the Applied Semantics CIRCA technology, acquired by Google in April 2003, is the best explanation for the peculiar dictionary matching behavior. ]

Why did they do it?

This filter is taking down a lot of innocent sites at the worst possible time of year, the Christmas shopping season. How can Google be so dumb? By now it has been confirmed by the vice president of engineering at Google, Wayne Rosing, that this is part of a new algorithm. In other words, it’s mainly deliberate. We knew this already, because otherwise Google would have turned it off or rolled back the update. The algorithm may have produced unexpected results, and it can certainly be described as a “screw up,” but it’s not merely a bug.

The best scenario for what happened goes back about eight months, when Google stopped their monthly crawl of the web. Something ugly happened and Google had to throw out an entire crawl, and revert back to old data. Ever since then, Google has functioned without the old-style, once-per-month calculation of PageRank. I speculated on this in a June essay, Is Google Broken? Today I still think this is a reasonable point of departure for understanding Google.

For the last eight months it has been easy to spam Google using keywords in the anchor text of external links. Such linking overrode PageRank so completely that strange results were showing up on the first page for very competitive searches. It used to be called “Googlebombing” when bloggers started playing with it a year earlier. But that was different. The bloggers would have fun bombing for a few weeks only. Then the next monthly crawl came along, PageRank was recomputed for the entire web, and their cute tricks were buried in the rankings. The first year of Googlebombing was mainly a consequence of the “freshbot,” not the “deepbot.” During the last eight months, however, the same tricks were sticking from one month to the next, as the old-style monthly crawl was discontinued. This opened the door for a lot of ecommerce spam.

The current filter appears to be a rear-guard attack on this ecommerce spam. It was ill-advised. One poster on a webmaster forum speculated that it got approved at Google due to a statistical oversight. Someone may have assumed that the rate of false positives for the algorithm was acceptable, but was computing this rate for the entire ecommerce sector. When you apply it to the actual web you discover — too late! — that the probability equation’s false positives are horribly skewed at the top of the results. Only the first two pages of links (10 links per page) really matter much for searchers. Indeed, the hit rate for ecommerce terms is very high, even going as deep as 100 links. Many innocent mom and pop sites are getting buried, and many of the sites that remain are spammy directories. It’s not working well.

Another observer felt that the entire effort was aimed at affiliate programs, which are concentrated in the travel, real estate, adult, gambling and pharmacy sectors of ecommerce. But baby products, maternity, and bridal accessories, which are often home businesses run by women, are also hit hard. Innocent site owners such as these are angry with Google. Many feel that they are being deliberately forced to bid on AdWords so as to enhance Google’s profit margins in the months before filing an IPO. For its part, Google claims that the department responsible for the main index has nothing to do with the advertising side of Google. Whatever you choose to believe, the fact of the matter is that whether it was deliberate or not, the “dictionary” terms used in the filter overlap very substantially with the terms that fetch the highest AdWord bids.

My feeling is that Google has reached the limits of fast software when it comes to separating and ranking web pages. They cannot merely slap a new algorithm onto their index at this point without butchering innocent site owners. Perhaps it’s time for Google to do some real content analysis and clustering of pages. But that would mean more computational overhead, more hardware, more money, lower profits, and slower speeds.

Short of that, Google could use some sort of structured appeal process for webmasters who have been treated unfairly by new algorithms. Google won’t consider this because it isn’t “scalable” — which means you can’t expand it in your quest to take over the web unless you keep throwing more money and effort into it. Algorithms, on the other hand, are cool because you write them once, and copy them to 10,000 cheap computers.

Google has to do something, and they could afford to hire some ombudsmen if they wanted to. Even a contract employee on minimum wage, with a little training, can tell the difference between a spammy affiliate site and a family niche business. If Google’s Ph.D.s with their clever algorithms can’t do as well as temp employees, then the Ph.D.s should be replaced.

It’s a mess. Google’s integrity is on the line. If they keep this up, all their dreams of riches from stock options will vanish. Who’s in charge at the Googleplex anyway? There isn’t much time.

Daniel Brandt operates Public Information Research, PO Box 680635, San Antonio TX 78268-0635

Tel:210-509-3160

Fax:210-509-3161

Nonprofit publisher of NameBase, http://www.yahoo-watch.org, and http://www.google-watch.org/

namebase@earthlink.net

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles