• On CBS MoneyWatch: 5 Things You Should Buy at Walmart

News Blog

Read all 'search history' posts in News Blog
November 30, 2007 8:30 AM PST

AOL, Netflix and the end of open access to research data

by Chris Soghoian
  • 6 comments

Correction: The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.

Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.

Netflix

(Credit: Flickr / thebluedino)

In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company's system of DVD recommendation. In order to protect its customers' privacy, Netflix anonymized the data set by removing any personal details.

Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.

For Internet privacy geeks, this Netflix incident is just another version of an all-too-familiar tale: A well-meaning company releases a large data set of user data, which it has scrubbed to remove any identifying information. Armed with this data set, researchers are able to trace backwards, and match names to the profiles and their online behavior.

The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database's release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything."

The fallout from the AOL incident was devastating, both for the company and the industry as a whole. The CTO of the company and the researchers responsible for sharing the data were all fired. In addition to pulling the data set, the entire Web presence for AOL's research division was taken offline. More than one year onward, the AOL Research group still does not have a working homepage.

The shockwaves spread to the entire search engine industry. Google's CEO Eric Schmidt spoke to journalists shortly after AOL posted the data. After calling the data release "a terrible thing," he assured the public that "this kind of thing could not happen at Google."

The end result was that no search engine would ever again release anonymized log data to the research community.

Big Brother

(Credit: Flickr / surfstyle)

The announcement by researchers of their Netflix project is so recent that it has yet to be seen how the company will respond. The data has been public for over a year, and With a $1 million prize, the release almost certainly required the sign-off from executives (and so the company cannot blame rogue researchers as AOL did). While search engine logs are obviously extremely sensitive, video rental records are also very private. Enough so that Congress has given video rental records a higher level of protection than almost any other form of personal data (this was prompted by the worry that the politicians' own rental records could be published by journalists).

Companies do not make money by giving researchers access to data. They do it to promote and encourage research in the field. Based on the AOL and Netflix incidents, I suspect that we will see a major chill hit the industry. No high-tech company with large amounts of user data will ever again risk making it available to researchers without first requiring them to sign a lengthy contract. The risk of the data being de-anonymized (and the resulting public relations and legal trouble) is simply not worth it.

So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings. It is for good reasons that people are dubious when drug companies sponsor research into the safety of one of their drugs. When a company holds the keys to the data, they can stop the publication of anything which will make them look bad.

As a privacy advocate and end user, I think the shift against sharing anonymized data is probably a good thing. After all, I don't want some random student browsing through my search history, anonymized or not. However, if I take the end-user hat off, and put on my PhD student hat, then this is a really bad thing. Researchers depend on accurate data in order to do their work. Without the data, we don't get new exciting research, and thus no new cool technologies. For the research community, this Netflix incident will be the final nail in the coffin of information sharing from the dot-coms.

Originally posted at Surveillance State
April 20, 2007 3:38 PM PDT

Google broadens, renames Search History

by Elinor Mills
  • Post a comment

Google has renamed its "Search History" service "Web History" and broadened its coverage. Previously, the service would record your Google searches. Now, Web History can associate the web pages you visit with your Google Account. Web History keeps a list of the times and links to the web pages viewed and searches conducted. Users have to be signed in to their Google account and need to have the Google Toolbar installed with PageRank enabled.

April 20, 2007 6:39 AM PDT

Your Web history, courtesy of Google

by Margaret Kane
  • 14 comments

Google's announced acquisition of DoubleClick has raised considerable concern among privacy advocates, who argue that combining the search engine giant with a major online advertising firm puts too much information in the hands of one company.

Your Web history, courtesy of Google

The launch of Google's new Web History product should send those fears into overdrive.

The new service allows you to search and view your entire online life, including which pages you visited and when. Google will also analyze your online travels, revealing which sites you visit most frequently and what your top searches are.

The data is available only when you log on with your Google account and password, and Google does have a feature that lets you remove items or turn off the service. The tool itself can be extremely useful, both to users and to developers. But many bloggers looked askance at a tool that lays right out in the open the fact that Google knows just about everything you see and everything you do online.

Blog community response:

"Yes, that is truly amazing, if it works, and is a feature that could make one overlook all of the creepiness of being shown the reality of everything Google knows about you when you use one service for searching, mapping, comparing products, sending email, and then, embed a tool of theirs in your web browser."
--Rex Hammock's Weblog

"Outside of the world of users who gawk at every shiny new thing on the web, though, this is going to give people the heebie-jeebies in a way that we're probably only used to getting from Microsoft. In fact, it's probably safe to say that no other major web company could release this product today; The backlash from the user community of players like Microsoft, Yahoo, or AOL would simply be too strong."
--Anil Dash

"Should you be concerned? Of course. Everyone should be concerned about their private data. Everyone should really think about what is being logged and how it is being used. But we also make tradeoffs. We want certain things from companies, and to get them, we have to give up some of our privacy often trusting it will be protected."
--Search Engine Land

  • prev
  • 1
  • next
advertisement
Click Here

Five New Year's resolutions for Google

Stakes are high as Google attempts to maintain one of the Internet's greatest cash machines while pushing into new and risky markets.
• Android event set for Jan. 5

For eBay sellers, a holiday hamster hangover

The gift frenzy over Zhu Zhu Pets leaves some power sellers feeling like they've just run a marathon--but the steep price tags lead to some impressive profits.

About News Blog

Recent posts on technology, trends, and more.

Add this feed to your online news reader



advertisement

Inside CNET News

Scroll Left Scroll Right