Wednesday, September 18, 2024

Of Course Size Matters!

I’ve been trying to stay out of the size debate for the last few days while I digest what others have been saying. Now that I’ve done that, I get to react to a few of the things I’ve been reading.

A Rhetorical Question

First off all, many people are suddenly crying “size doesn’t matter!” and that doesn’t smell right. If size really doesn’t matter, then why didn’t anyone jump on Google for having that counter (“Searching 8,168,684,336 web pages”) on their home page for so long? They have one of the most sparse home pages around but seem to believe it’s important enough to waste a few bytes with that number.

We all know that number is [BS] anyway, right? When’s the last time it changed? Oh, right. When MSN Search declared a larger number. Coincidence, I’m sure.

It seems odd to me that size became irrelevant right about the time that Yahoo! comes out witch a much larger number. It’s almost as if some Google fans are in denial. There’s got to be some reason that our number has evoked such emotional responses.

But, hey… that’s just me.

All that aside, how can you argue that size doesn’t matter? If Google indexed only 100,000 documents, would it be nearly as useful as it is today? Of course not. Without indexing a reasonable amount of the Web, they’d be missing important stuff.

Relevancy

Danny Sullivan says “Screw Size!” and he’s right. Having a big f’ing index doesn’t help if you can’t figure out how to return relevant documents. He’d rather we compare relevancy.

I couldn’t agree more. Relevancy is what matters and it’s a simple test you can do yourself. Try your search on both sites and see which one provides the better results.

Or you can use RustySearch and rate results while you’re at it. The RustyBrick Search Engine Relevancy Challenge aims to quantify which service produces better results. If you look at the results, you’ll see that Yahoo is quite competitive. Last I looked, we were ahead of Google by a small margin.

I know, I know. It’s not a perfect measure. There are flaws in the system. The audience is wrong. The sample not large enough. Etc.

There are a lot of holes you can poke there. But that doesn’t mean it’s not useful.

Speaking of holes…

The NCSA “Study”

I knew it was going to be one of those days when the NCSA results got linked on Slashdot. As I expected, the slashdot herd jumped for joy at the chance to prove Yahoo wrong and hold Google up as the reigning champ of web search and all things non-evil.

But I didn’t see anyone look very closely at the methodology or results, which are all public (as is the source code). As Seth noted, “The methodology is severely flawed, with a sampling-error bias.” In fact, there are so many poor assumptions behind it that I had to laugh when I read about it. It’s really more of a clever hack than a scientific comparison. I see little evidence that anyone looked at the actual results.

Using randomly chosen words doesn’t reflect the real world at all. But even if you suspend logic for a while and look at some of the cases when Google “beat” Yahoo, it gets more interesting. The “extra” results on Google are dominated by pages that are simple large word lists.

Seth listed one that illustrates the problem quite well. Search for “alkaloid’s observance” Google+Search”>on Google and on Yahoo. Guess what. On Yahoo you find no results but Google shows several. Dig a bit deeper and you see that the pages Google found are garbage. This page (the #1 result) no longer contains the target phrase. So you check the cached copy and notice that it’s just a bunch of gibberish words. (Hmm. A freshness problem and a quality problem?)

You know, we index those too. But we filter ’em out because they’re pretty useless. I’m not sure why Google thinks those are good pages to include, but hey–it boosts the numbers! Our algorithms manage to suppress such pages and I doubt anyone misses them.

Believe it or not, coming up with a really good relevancy comparison is quite hard. And it’s even harder to get right when you take the humans out of the loop.

The Bottom Line

So what’s the point?

Index size matters, but it’s not all that matters. Big index is a necessary but not sufficient condition for getting search right. Good algorithms for finding relevant documents do the heavy lifting required to find the right matches for each query.

We’ve got some of those too… 🙂

Kaigene said it best back when we hit the 1 billion mark in images: “Yes, size does matter. But only if you know how to use it. ;-)”

May the best engine win!

Update: Someone just poitned out that Yahoo! Search now returns a few results for that query–both from Steh’s blog. I guess this is more of a Heisenberg problem than I first thought!

Update #2: If you read French, this post might be interesting too. You can also translate it with Babelfish. It’s a pretty good analysis of the problems with the NCSA test.

Update #3: Also, Gary Price listed several things to consider when trying to measure index sizes.

Links:
Index size debate
See what others are saying

Reader Comments…

Jeremy Zawodny is the author of the popular Jeremy Zawodny’s blog. Jeremy is part of the Yahoo search team and frequently posts in the Yahoo! Search blog as well.

Visit Jeremy’s blog: Jeremy Zawodny’s blog.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles