• On GameFAQs: What causes the Red Ring of Death?
November 30, 2007 8:30 AM PST

AOL, Netflix and the end of open access to research data

by Chris Soghoian

Correction: The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.

Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.

Netflix

(Credit: Flickr / thebluedino)

In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company's system of DVD recommendation. In order to protect its customers' privacy, Netflix anonymized the data set by removing any personal details.

Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.

For Internet privacy geeks, this Netflix incident is just another version of an all-too-familiar tale: A well-meaning company releases a large data set of user data, which it has scrubbed to remove any identifying information. Armed with this data set, researchers are able to trace backwards, and match names to the profiles and their online behavior.

The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database's release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything."

The fallout from the AOL incident was devastating, both for the company and the industry as a whole. The CTO of the company and the researchers responsible for sharing the data were all fired. In addition to pulling the data set, the entire Web presence for AOL's research division was taken offline. More than one year onward, the AOL Research group still does not have a working homepage.

The shockwaves spread to the entire search engine industry. Google's CEO Eric Schmidt spoke to journalists shortly after AOL posted the data. After calling the data release "a terrible thing," he assured the public that "this kind of thing could not happen at Google."

The end result was that no search engine would ever again release anonymized log data to the research community.

Big Brother

(Credit: Flickr / surfstyle)

The announcement by researchers of their Netflix project is so recent that it has yet to be seen how the company will respond. The data has been public for over a year, and With a $1 million prize, the release almost certainly required the sign-off from executives (and so the company cannot blame rogue researchers as AOL did). While search engine logs are obviously extremely sensitive, video rental records are also very private. Enough so that Congress has given video rental records a higher level of protection than almost any other form of personal data (this was prompted by the worry that the politicians' own rental records could be published by journalists).

Companies do not make money by giving researchers access to data. They do it to promote and encourage research in the field. Based on the AOL and Netflix incidents, I suspect that we will see a major chill hit the industry. No high-tech company with large amounts of user data will ever again risk making it available to researchers without first requiring them to sign a lengthy contract. The risk of the data being de-anonymized (and the resulting public relations and legal trouble) is simply not worth it.

So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings. It is for good reasons that people are dubious when drug companies sponsor research into the safety of one of their drugs. When a company holds the keys to the data, they can stop the publication of anything which will make them look bad.

As a privacy advocate and end user, I think the shift against sharing anonymized data is probably a good thing. After all, I don't want some random student browsing through my search history, anonymized or not. However, if I take the end-user hat off, and put on my PhD student hat, then this is a really bad thing. Researchers depend on accurate data in order to do their work. Without the data, we don't get new exciting research, and thus no new cool technologies. For the research community, this Netflix incident will be the final nail in the coffin of information sharing from the dot-coms.

Christopher Soghoian delves into the areas of security, privacy, technology policy and cyber-law. He is a student fellow at Harvard University's Berkman Center for Internet and Society , and is a PhD candidate at Indiana University's School of Informatics. His academic work and contact information can be found by visiting www.dubfire.net/chris/. He is a member of the CNET Blog Network, and is not an employee of CNET. Disclosure.
Recent posts from Surveillance State
YouTube's new 'nocookie' feature continues to serve cookies
Is the White House changing its YouTube tune?
Recovery.gov blocked search engine tracking
Obama's BlackBerry brings personal safety risks
White House expands use of search-blocking code
Activists call for a mashup-friendly Recovery.gov
White House yanks 'YouTube' from privacy policy
White House acts to limit YouTube cookie tracking
Add a Comment (Log in or register) (6 Comments)
  • prev
  • 1
  • next
by jewforjesus November 30, 2007 1:34 PM PST
Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.

No they didn't. They matched newly released anonymous information to PUBLICLY AVAILABLE movie ratings attached only to a pseudonym. Whoop-de-doo. Nothing at all was learned here.
Reply to this comment
by jewforjesus November 30, 2007 1:36 PM PST
To clarify, this is what happened:

1. Someone submits public ratings of over 100 movies on a website.
2. Netflix makes available a list of over 100 movies.
3. The list is linked to the public ratings.

Was any privacy lost? NO! We have learned NOTHING. We already KNEW the guy saw these movies because he RATED THEM ON A PUBLIC WEBSITE.
Reply to this comment
by DBA687 November 30, 2007 3:20 PM PST
Privacy is lost.

1. Someone submits public ratings of over 100 movies on a website.
2. Netflix makes available a list of over 250 movies.
3. The list is linked to the public ratings.

The user has lost the privacy in regards to the 150 movies in the Netflix data but not on the public reviews website. Some real reviewer names are know for public movie review sites and it's a real loss of privacy for those people.
Reply to this comment
by dpeelmd December 1, 2007 9:41 AM PST
This story shows that seemingly harmless anonymized commercial information can be easily re-identified to build very damaging political, sexual, and even psychological profiles of Netflix users. Netflix released over 100 million movie ratings made by 500,000 subscribers.

What if your future employer used data from Netflix and other sources to create not just a voting and sexual profile, but a profile of your risk for expensive diseases?

Narayanan and Shmatikov showed us what they learned about one Netflix user: ?First, we can immediately find his political orientation based on his strong opinions about ?Power and Terror: Noam Chomsky in Our Times? and ?Fahrenheit 9/11.? Strong guesses about his religious views can be made based on his ratings on ?Jesus of Nazareth? and ?The Gospel of John?. He did not like ?Super Size Me? at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, ?Bent? and ?Queer as folk? were rated one star out of five. He is a cultish follower of ?Mystery Science Theater 3000?. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.?

What does Narayanan and Shmatikov?s re-identification research mean for the nation?s treasure trove of health data? Anonymized or de-identified health records are clearly not safe either. Electronic health records contain far more details than Netflix movie ratings, making them even easier to re-identify.

Today Americans have no control over ANY electronic prescription, genetic, or health records. Employers, insurers, banks, and schools can all data mine our health records without consent.

The health data mining industry is huge and extremely lucrative. Two examples:

1) BCBS?s Blue Health Initiative sells data on all 79 million enrollees to help large employers lower costs.

2) IMS Health, a prescription data miner, reported revenues of $1.75 Billion dollars in 2005 selling supposedly de-identified prescription records.

Tell Congress to restore your right to control your personal health information. A good place to start is to end prescription data mining. Sign our petition now at: www.patientprivacyrights.org/site/PageServer?pagename=Prescription_Privacy_Video

Only Congress can restore our privacy rights in the Digital Age. Americans should have the right to control access to personal health records and the right to control access to electronic financial and commercial information too?including control access to our Netflix movie ratings.

Why should Netflix be able to reveal anyone's movie ratings for any reason without consent?

Deborah C. Peel, MD
www.patientprivacyrights.org
Reply to this comment
by randomwalker December 1, 2007 12:04 PM PST
There's is much misconception going around about our paper on the Netflix prize dataset, which is why we have released an FAQ about our de-anonymization algorithm and results

http://www.cs.utexas.edu/~shmat/netflix-faq.html

Also, note that this work is about a year old -- we released the first version of the paper just two weeks after the data was released; it seems to have somehow hit the media recently.

--Arvind Narayanan
Reply to this comment
by FO-FI_FO_454 December 4, 2007 2:54 PM PST
SUBJECT: 2 Movies - "ALL THE KINGS MEN" and "THE MAN WHO WOULD BE KING" - by the B-d Light FROGS, let's give them names, "Louie and Benny."

Was tuned into the big game last night, with a leading Satellite Radio provider, Patriots 27 Ravens 24, 14 seconds to go - surrounded by AM/FM back-up, CB Radio back-up, Desktop Shortwave back-up (for my grunting and grumbling contacts in the UK), FREE TV on for news alerts, wireless device tuned into my son's number 3,000 miles away, 14 seonds to go for a final, and guess what? The Satellite Radio link simply vanished - NADA - went away - NOTHING.

Had to get down on that game last night, so I contacted "Nathan Detroit" of Guys and Dolls fame - he sent over "Nicely Nicely" of the same gig, and we pashionately were awaiting the finaly outcome - NOTHING - with a clear view of the sky, all areas of the compass - gone - nothing. I thought it was a flock of birds - NOPE - the signal just vanished.

"Nicely Nicely" said "It's a trick - somebody is trying to fix the game and prevent the world from knowing the outcome." I looked at him and laughed, and said "It's probably a Patriots ACT (pun intended) test to see if all communications in a given area can be shut down. I walked 4 feet to my front door, looked up upon a clear sky and then tried to make a few porfane gestures at an invisible satellite - couldn't see any, but, it COULD SEE ME. Just to make sure this never happens again, I wrote my name on a sheet of bond paper alongwith my social security number and nailed it to the front of my door. After all, I've got nothing to hide from Big Brother. Then, I performed a GOOGLE EARTH SEARCH for my address and lo' and behold, I was ON TOP OF MY BUILDING ON THE INTERNET - however, I couldn't see the sign I had just nailed to the front door.

"Nicely Nicely" left, I was alone, then watched a rerun of Gene Hackman in the flick "The Conversation" - interesting movie, deals with Big Brother long before the Patriot ACT. If you haven't seen it buy or rent it - well worthwhile.

I looked around my pad, saw all of this electronic and computer equipement, then popped open a new CD in which there was a security monitoring strip, you know, about 2 inches by 1/4 inch, self adhesive, magnetic tape inside which blows the whistle on you in case you want to be nailed as a THIEF.

I took this security strip in my hand and tried to imagine how many different places it could be installed if indeed Big Brother was monitoring me and I laughed, I had thought of something very wierd. I wanted to inset it into a body cavity but, I was out of jelly.

This is what I imagined, 2 FROGS, Louie and Benny, stretched out on deck chairs somewhere living LARGE, smoking stogies, not a care in the world, and then a conversation begins:
CHARADE
Louie: Benny, wake up, WAKE UP, see the headlines.
Benny: What time is it - what do you want?
Louie: Did you hear about the Data Discs lost in England?
Benny: Yeah, that's peanuts.
Louie: Did we have anything to do with that?
Benny: NO - that's peanuts...TOLD YOU - PEANUTS!
Louie: Pass over a Bud - thanks.
Benny: Louie, after next week, we're going to be KINGS OF THE WORLD - told you, it's all going to happen as planned.
Louie: What's left to uplink?
Benny: Australia and New Zealand, then, we're done.
Louie: Are our demands still the same Benny - are we asking for too much, after all, it's just the 2 of us.
Benny: Louie - next week, when we pull the plug, nobody is going to be able to speak anything other than FROG - we will shut down all communcations, worldwide, all data input, and output, and then, the rest of the world will understand how important it is not to put all of their eggs into one basket.
END OF CHARADE

CROAK, CROAK, CROAK - thank you for this opportunity to CROAK BACK - the FROGS.
Reply to this comment
(6 Comments)
  • prev
  • 1
  • next
advertisement

After 5 years, Firefox faces new challenges

Mozilla helped reshape the Web since releasing Firefox 1.0 five years ago. Now it's got a reawakened Microsoft and Google Chrome to reckon with.

There's a map for that: GPS or smartphone?

Almost every handset comes with mapping software these days, but standalone GPS devices are becoming more affordable than ever.

advertisement

About Surveillance State

Christopher Soghoian delves into the areas of security, privacy, technology policy and cyber-law. He is a student fellow at Harvard University's Berkman Center for Internet and Society, and is a PhD candidate at Indiana University's School of Informatics. His academic work and contact information can be found by visiting www.dubfire.net/chris/. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

Surveillance State topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right