Thursday, September 19, 2024

Netflix Dataset Cracked, Subscribers Profiled

Netflix offered a million dollar reward to anyone who could improve upon their recommendation engine by ten percent. Two researchers accomplished a lot more with the “anonymized” dataset.

The Netflix Prize provided researchers with records comprising 100,480,507 movie ratings made by 480,189 subscribers, made between December 1999 and December 2005. The company challenged people to beat Netflix at its own recommendations.

The physics arXiv blog noted Netflix claimed to have removed personal details from the dataset before making it available. However, Arvind Narayanan and Vitaly Shmatikov at the the University of Texas at Austin figured out how to de-anonymize that data.

The research paper on how they did it demonstrated the inherent risk in publishing such micro-data, or information about specific individuals.

“Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information,” the researchers said in the paper’s abstract.

Through their algorithmic work, the researchers could tie information in the Netflix dataset with recommendations made on the Internet Movie Database website:

We expect that for Netflix subscribers who use IMDb, there is a strong correlation between their private Netflix ratings and their public IMDb ratings. Note that our attack does not require that all movies rated by the subscriber in the Netflix system be also rated in IMDb, or vice versa. In many cases, even a handful of movies that are rated by the subscriber in both services would be sufficient to identify his or her record in the Netflix Prize dataset…

Briefly, people who rated movies publicly around the same time they rated those movies privately gave the researchers enough data to figure out details about one person.

“A natural question to ask is why would someone who rates movies on IMDb – often under his or her real name – care about privacy of his movie ratings?” the researchers asked.

“Consider the information that we have been able to deduce by locating one of these users

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles