So Yahoo has launched a new customized search results feature called MindSet. Basically it allows you to determine the type of results you want. From the Yahoo! MindSet homepage you’ll see this description of the new custom search feature.
A Yahoo! Research Labs demo that applies a new twist on search that uses machine learning technology to give you a choice: View Yahoo! Search results sorted according to whether they are more commercial or more informational (i.e., from academic, non-commercial, or research-oriented sources).
Based on that description one would assume a major search player has finally given you the ability to filter out search results that are of a commercial nature. I predict this feature is not something that will ever make it to the Yahoo! homepage but could possibly make it into the MyYahoo portal page as an option. If that’s the case then it will be quite some time before that happens.
Here’s more from their faq page on exactly what it is MindSet allows you to do that the vanilla Yahoo! search doesn’t allow you to do.
- 1. What does this Mindset demo do?
It allows you to sort search results for your query into commercial or non-commercial (informational) results, based on whether you’re shopping or seeking information.
2. What do you mean by commercial and non-commercial informational)?
Commercial implies that the primary purpose of a given page is to sell you something.
Informational implies that the primary purpose of the page is to provide information
related to your search.
3. That sounds vague. Aren’t many web pages a combination of commercial and informational?
Yes, that’s why we assign each page a relatively continuous score ranging from -2
(most commercial) to +2 (most informational). Pages scored 0 are a balance of commercial and informational.
4. How are these scores assigned?
We’re using machine learning technology developed here at Yahoo! Research Labs to score web results.
5. Are you confident that the scoring in this demo is correct?
Remember, this demo is a work in progress, put together by scientists to test new ideas and techniques. To start the scoring process, a small team of humans scored pages manually to develop the “seed set” of pages on which machine learning would be based. For the seed set, we didn’t rigorously require everyone to use the same scoring approach, so the scoring results may need some fine-tuning.
6. Does this suggest that the whole demo is gibberish?
We don’t think so. The scoring may not be perfect, but it’s good enough to get us started. Once we get more rigorous in our approach for manually scoring the seed set (perhaps by inviting smart users like you to do the scoring), automatic scoring should improve rapidly. Meantime, you’re invited to play with the demo and share your feedback
7. What does that slider thing at the top of the page do?
You control the slider to decide how you want the results sorted. The midpoint of the slider represents the default setting. In this position, the order of results matches Yahoo! Search web results. As you move the slider right, toward “researching” or left toward “shopping” the results are automatically re-sorted for you.
8. How does the slider position determine the re-sorting?
There are two different sorting mechanisms here:
1. default sorting done by Yahoo! Search;
2. secondary sorting based on assigned commercial and non-commercial scores.
With the slider in the middle position, only the default Yahoo! Search sort is used. When the slider is at either end, only the secondary commercial/non-commercial sort is used. But when the slider is anywhere in between, Yahoo! Mindset presents a blend of the two sorting systems. The more the slider is moved toward either extreme, the more weight we give to the second sorting method.
9. What are those little blue and orange bars under each result?
These colored bars represent the scores. A longer colored bar represents a higher
absolute score for the result, and the more definitively commercial (blue on the left) or informational (orange on the right) result. Scores with neither blue nor orange bars are 0 scores. This means that Mindset has determined the page is equal parts commercial/non-commercial or completely ambiguous.
10. What are those grey numbers in parentheses alongside the rank of each
result?
The number in parentheses represents the default rank for each result. Notice that with the slider set in the middle position, the displayed rank numbers and the grey default rank numbers are the same. But when you move the slider right or left, results are rearranged.
11. Please tell us what you think of this demo
We’d like to keep improving and developing Mindset, and so we really value your feedback..
Our primary goal here ain’t to impress you with the results, it’s to give you a look at the underlying technology. We believe machine learning technology is powerful and has many uses. This demo was just one example – using the commercial/non-commercial classification and the search metaphor – of how this technology could be used. After you’ve read more about machine learning in the next section, perhaps you’ll think of other ways it could be used.
Technology
- 1. What is the
- nature of the technology behind this demo?
This Mindset demo is an example of machine learning applied to the problem of text
classification. Machine learning and text classification are two different fields of
technical research that found common cause about ten years ago with the emergence of the Web.
2. What is text classification?
Text classification refers to the problem of classifying documents automatically into
different subject categories. An early challenge in this field involved the effort to
automatically classify technical academic papers. For example, some forty years ago, this technology was used to automatically assign key words to papers in aerospace and medicine.
3. Why is text classification useful for the Web?
The unstructured abundance of web documents presents a new and exciting text classification challenge. Accurate, automated classification can help users find the information they seek on the Web.
4. How does this demo apply the principles of text classification?
This demo classifies Web pages as either commercial or non-commercial. In this demo, result pages classified as commercial are designated as such by the little blue bar that appears under the result. Non-commercial (informational) results have little orange bars.
5. What is machine learning?
The field of machine learning studies and develops computer algorithms that improve
automatically through experience. Machine learning can tackle the problem of automatically learning and replicating a human activity. For example, think of a baby watching what the adults do and mimicking them. Loosely speaking, machine learning technology is like that baby. Machine learning starts with a “seed set” of human-generated data. This seed set is divided into a training set and a test set. Using the training set, the machine “learns” what the human was doing in creating that set, and then it tries to apply that learning to the test set. If it “fails the test”, it goes back to the training set and “relearns”. This iteration continues until the learning is complete. For further reading on machine learning, check out http://www-2.cs.cmu.edu/~tom/mlbook.html and http://jmlr.csail.mit.edu/.
6. Why is Machine Learning useful for the Web?
Machine learning is especially useful for applying human-like behavior to sets of data so large that it would be infeasible for humans to do the work. When the Web took off about ten years ago, machine learning acquired a cherished prize: a huge, and ever-growing corpus of data. With billions of pages and counting, the Web is too big for humans to encompass entirely. This is where machine learning comes in.
7. How does this demo make use of machine learning?
Machine learning technology is used here to score search results as commercial or non-commercial. More accurately, this technology scores each result on a continuous scale ranging from “very commercial” to “very non-commercial”
8. What will improve the accuracy of the classification in this demo?
The biggest improvement will come when we improve the seed set provided to the
machine learning technology. A small team at Yahoo! Research Labs generated
the seed set that informs this demo. Since there were very few of us, and nonw of us are professional editors, our seed set could use some non-trivial improvement. And since the machine learning technology can only be as “smart” as the humans who generated the seed set, improving that set will improve the accuracy of the demo.
I’ll be tinkering around with MindSet and giving a more in-depth review after I’m finished. Not sure it’s something I can actually see myself using in any great detail, but I think it’s interesting Yahoo is doing something all the major search players have been toying around forever. It’s not surprising this is a beta launch, beta’s the new “soft launch” for companies not wanting to spend money on advertising or support when they’re not sure if it’s going to last or not. Not that it’s a bad way to soft launch new features and products but it’s funny to me.
Jason Dowdell is a technology entrepreneur and operates the Marketing Shift blog.