Google announced on Monday that the company will be reducing the amount of time that it will keep sensitive, identifying log data on its search engine customers. To the naive reader, the announcement seems like a clear win for privacy. However, with a bit of careful analysis, it's possible to see that this is little more than snake oil, designed to look good for the newspapers, without delivering real benefits to end users.
In a post to the company blog on Monday, the company announced that it will be significantly reducing the amount of time that it hangs onto identifying user data in its Web server logs:
Today, we're announcing a new logs retention policy: we'll anonymize IP addresses on our server logs after 9 months. We're significantly shortening our previous 18-month retention policy to address regulatory concerns and to take another step to improve privacy for our users.
Hidden further down in the blog post, were a few more details:
We haven't sorted out all of the implementation details, and we may not be able to use precisely the same methods for anonymizing as we do after 18 months, but we are committed to making it work.
Google's announcement was extremely light on details, specifically, how the company planned to anonymize the records after 9 months. I contacted Google to find out more, and received an extremely interesting reply:
After nine months, we will change some of the bits in the IP address in the logs; after 18 months we remove the last eight bits in the IP address and change the cookie information. We're still developing the precise technical methods and approach to this, but we believe these changes will be a significant addition to protecting user privacy.... It is difficult to guarantee complete anonymization, but we believe these changes will make it very unlikely users could be identified.... We hope to be able to add the 9-month anonymization process to our existing 18-month process by early 2009, or even earlier.
To understand what this means (and how useless the new privacy "enhancements" are), consider the following:
When a user conducts a search using Google's search engine, the company stores three main types of information in a log file: the user's IP address (which is a unique network address given to her computer by her Internet service provider), the words that she searched for, and her cookie identifier (a unique value given to every Web-browser that visits a Google Web-property).
As per Google's existing policy, after 18 months Google "anonymizes" the IP address and cookie information from its logfiles. While the company hasn't said how it de-identifies the cookies, it has revealed in public statements that its IP anonymization technique consists of chopping off the last 8 bits of a user's IP address.
As an example, an IP address of a home user could be 173.192.103.121. After 18 months, Google chops this down to 173.192.103.XXX.
Since each octet (the numbers between each period of an IP) can contain values from 1-255, Google's anonymization technique allows a user, at most, to hide among 254 other computers. In comparison, Microsoft deletes the cookies, the full IP address and any other identifiable user information from its search logs after 18 months.
Google has now revealed that it will change "some" of the bits of the IP address after 9 months, but less than the eight bits that it masks after the full 18 months. Thus, instead of Google's customers being able to hide among 254 other Internet users, perhaps they'll be able to hide among 64, or 127 other possible IP addresses.
By itself, this is a laughable level of anonymity. However, it gets worse.
First, remember that Google will not delete or anonymize user cookies from the logs when it slightly smudges IP addresses after nine months. Second, remember that as long as you use a Google Web property at least once every two years, the company will maintain a unique identifiable cookie value within your Web browser.
Thus, consider the following scenario:
In June 2008, a user from 173.192.103.121 with cookie value 12345 conducts a search for "breast cancer risks." Nine months later, in March 2009, the company scrubs some portion of the IP address, perhaps to 173.192.103.1XX. However, the cookie remains in the log.
In April 2009, that same user returns to Google, and conducts a search for "stephen colbert youtube videos," again from the same IP and the same cookie value 12345.
Even though the 9-month-old search logs have been "anonymized", because the cookie values remain, it is trivial to match the newer search results to the older searches, and thus completely reverse the anonymization process.
The simple truth is that any IP anonymization technique, no matter how strong or weak, is simply a waste of time, if cookie values are not also anonymized.
Unfortunately, Google is relying on the fact that the mainstream media (I'm looking at you New York Times and Washington Post) are clueless on these issues, as well as seemingly most of the technology press. Google's new anonymization policy is totally worthless, and the company deserves to be called out for its deception.
Disclaimer: I interned at Google during the summer of 2006 and received a $5,000 Google fellowship in both 2006 and 2007. I have also interned or worked for both the Electronic Privacy Information Center (EPIC) and the American Civil Liberties Union (ACLU) of Northern California, public interest groups that have been extremely critical of Google's privacy policies.
Correction: The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.
Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.
Netflix
(Credit: Flickr / thebluedino)In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company's system of DVD recommendation. In order to protect its customers' privacy, Netflix anonymized the data set by removing any personal details.
Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.
For Internet privacy geeks, this Netflix incident is just another version of an all-too-familiar tale: A well-meaning company releases a large data set of user data, which it has scrubbed to remove any identifying information. Armed with this data set, researchers are able to trace backwards, and match names to the profiles and their online behavior.
The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database's release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything."
The fallout from the AOL incident was devastating, both for the company and the industry as a whole. The CTO of the company and the researchers responsible for sharing the data were all fired. In addition to pulling the data set, the entire Web presence for AOL's research division was taken offline. More than one year onward, the AOL Research group still does not have a working homepage.
The shockwaves spread to the entire search engine industry. Google's CEO Eric Schmidt spoke to journalists shortly after AOL posted the data. After calling the data release "a terrible thing," he assured the public that "this kind of thing could not happen at Google."
The end result was that no search engine would ever again release anonymized log data to the research community.
Big Brother
(Credit: Flickr / surfstyle)The announcement by researchers of their Netflix project is so recent that it has yet to be seen how the company will respond. The data has been public for over a year, and With a $1 million prize, the release almost certainly required the sign-off from executives (and so the company cannot blame rogue researchers as AOL did). While search engine logs are obviously extremely sensitive, video rental records are also very private. Enough so that Congress has given video rental records a higher level of protection than almost any other form of personal data (this was prompted by the worry that the politicians' own rental records could be published by journalists).
Companies do not make money by giving researchers access to data. They do it to promote and encourage research in the field. Based on the AOL and Netflix incidents, I suspect that we will see a major chill hit the industry. No high-tech company with large amounts of user data will ever again risk making it available to researchers without first requiring them to sign a lengthy contract. The risk of the data being de-anonymized (and the resulting public relations and legal trouble) is simply not worth it.
So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings. It is for good reasons that people are dubious when drug companies sponsor research into the safety of one of their drugs. When a company holds the keys to the data, they can stop the publication of anything which will make them look bad.
As a privacy advocate and end user, I think the shift against sharing anonymized data is probably a good thing. After all, I don't want some random student browsing through my search history, anonymized or not. However, if I take the end-user hat off, and put on my PhD student hat, then this is a really bad thing. Researchers depend on accurate data in order to do their work. Without the data, we don't get new exciting research, and thus no new cool technologies. For the research community, this Netflix incident will be the final nail in the coffin of information sharing from the dot-coms.
In a recent blog posting, a German operator of a Tor anonymous proxy server revealed that he was arrested by German police officers at the end of July. Although he was released shortly afterwards, information about the arrest had been kept quiet until his lawyers were able to get the charges dropped.
Tor Project Logo
(Credit: Tor Project)Tor is a privacy tool designed to allow users to communicate and browse anonymously on the Internet. It's endorsed by the Electronic Frontier Foundation and other civil liberties groups as a method for whistle blowers and human rights workers to communicate with journalists. Tor provides anonymous Web-browsing software to hundreds of thousands of users around the world, according to its developers. The largest numbers of users are in the United States, the European Union and China.
The police were investigating a bomb threat posted to an online forum for German police officers. The police traced one of the objectionable posts on the forum to the IP address for Janssen's server. Up until his arrest, Alex Janssen's Tor server carried more than 40GB of random strangers' Internet traffic each day.
Showing up at his house at midnight on a Sunday night, police cuffed and arrested him in front of his wife and seized his equipment. In a display of both bitter irony and incompetence, the police did not take or shutdown the Tor server responsible for the traffic they were interested in, which was located in a different city, more than 500km away.
Janssen's attempts to explain what Tor is to the police officers initially fell on deaf ears. After being interrogated for hours, someone from the city of Düsseldorf's equivalent of the Department of Homeland Security showed up and admitted to Janssen that they'd made a mistake. He was released shortly after.
Germany is clearly not going out of its way to make computer security researchers and activists feel too welcome. Germany recently passed a law that "renders the creation and distribution of software illegal that could be used by someone to break into a computer system or could be used to prepare a break in. This includes port scanners like nmap, security scanners like nessus [as well as] proof of concept exploits."
Back in summer 2006, German authorities conducted a simultaneous raid of seven different data centers, seizing 10 Tor servers in the process. Agents took the servers believing them to be related to a child porn investigation. Furthermore, in 2003 a German court ordered the developers of the Jap anonymity system, a completely different project than Tor, to create a back-door in their system to be used in national security investigations.
This event does raise some interesting legal questions. If 40GB of other people's Internet traffic flows through your own home network, can authorities, be they the RIAA or FBI, reasonably link anything that has been tracked to your computer's IP address to you?
Does setting up a Tor server give you the ultimate plausible deniability card? "No officer, that BitTorrent download wasn't mine. It was from one of the thousands of people who route their Internet traffic through the anonymizing sever on my home network."
The ability to have a believable claim to plausible deniability is something that some of us have been attempting to get for a while by having an open wireless access point at home. And 40GB of Internet traffic from perfect strangers may be more significant in the eyes of a court than the possibility of one or two of your neighbors connecting to your wireless network. All of this, for now, remains theoretical. No Tor-related case has made it to the courts.. but it's just a matter of time until one does.
- prev
- 1
- next





