Google announced on Monday that the company will be reducing the amount of time that it will keep sensitive, identifying log data on its search engine customers. To the naive reader, the announcement seems like a clear win for privacy. However, with a bit of careful analysis, it's possible to see that this is little more than snake oil, designed to look good for the newspapers, without delivering real benefits to end users.
In a post to the company blog on Monday, the company announced that it will be significantly reducing the amount of time that it hangs onto identifying user data in its Web server logs:
Today, we're announcing a new logs retention policy: we'll anonymize IP addresses on our server logs after 9 months. We're significantly shortening our previous 18-month retention policy to address regulatory concerns and to take another step to improve privacy for our users.
Hidden further down in the blog post, were a few more details:
We haven't sorted out all of the implementation details, and we may not be able to use precisely the same methods for anonymizing as we do after 18 months, but we are committed to making it work.
Google's announcement was extremely light on details, specifically, how the company planned to anonymize the records after 9 months. I contacted Google to find out more, and received an extremely interesting reply:
After nine months, we will change some of the bits in the IP address in the logs; after 18 months we remove the last eight bits in the IP address and change the cookie information. We're still developing the precise technical methods and approach to this, but we believe these changes will be a significant addition to protecting user privacy.... It is difficult to guarantee complete anonymization, but we believe these changes will make it very unlikely users could be identified.... We hope to be able to add the 9-month anonymization process to our existing 18-month process by early 2009, or even earlier.
To understand what this means (and how useless the new privacy "enhancements" are), consider the following:
When a user conducts a search using Google's search engine, the company stores three main types of information in a log file: the user's IP address (which is a unique network address given to her computer by her Internet service provider), the words that she searched for, and her cookie identifier (a unique value given to every Web-browser that visits a Google Web-property).
As per Google's existing policy, after 18 months Google "anonymizes" the IP address and cookie information from its logfiles. While the company hasn't said how it de-identifies the cookies, it has revealed in public statements that its IP anonymization technique consists of chopping off the last 8 bits of a user's IP address.
As an example, an IP address of a home user could be 173.192.103.121. After 18 months, Google chops this down to 173.192.103.XXX.
Since each octet (the numbers between each period of an IP) can contain values from 1-255, Google's anonymization technique allows a user, at most, to hide among 254 other computers. In comparison, Microsoft deletes the cookies, the full IP address and any other identifiable user information from its search logs after 18 months.
Google has now revealed that it will change "some" of the bits of the IP address after 9 months, but less than the eight bits that it masks after the full 18 months. Thus, instead of Google's customers being able to hide among 254 other Internet users, perhaps they'll be able to hide among 64, or 127 other possible IP addresses.
By itself, this is a laughable level of anonymity. However, it gets worse.
First, remember that Google will not delete or anonymize user cookies from the logs when it slightly smudges IP addresses after nine months. Second, remember that as long as you use a Google Web property at least once every two years, the company will maintain a unique identifiable cookie value within your Web browser.
Thus, consider the following scenario:
In June 2008, a user from 173.192.103.121 with cookie value 12345 conducts a search for "breast cancer risks." Nine months later, in March 2009, the company scrubs some portion of the IP address, perhaps to 173.192.103.1XX. However, the cookie remains in the log.
In April 2009, that same user returns to Google, and conducts a search for "stephen colbert youtube videos," again from the same IP and the same cookie value 12345.
Even though the 9-month-old search logs have been "anonymized", because the cookie values remain, it is trivial to match the newer search results to the older searches, and thus completely reverse the anonymization process.
The simple truth is that any IP anonymization technique, no matter how strong or weak, is simply a waste of time, if cookie values are not also anonymized.
Unfortunately, Google is relying on the fact that the mainstream media (I'm looking at you New York Times and Washington Post) are clueless on these issues, as well as seemingly most of the technology press. Google's new anonymization policy is totally worthless, and the company deserves to be called out for its deception.
Disclaimer: I interned at Google during the summer of 2006 and received a $5,000 Google fellowship in both 2006 and 2007. I have also interned or worked for both the Electronic Privacy Information Center (EPIC) and the American Civil Liberties Union (ACLU) of Northern California, public interest groups that have been extremely critical of Google's privacy policies.
European regulators sent shock-waves through the search engine industry earlier this week, when they proposed significantly tighter rules for logging data. If the EU adopts the proposed rules, Google, Yahoo and Microsoft will have to significantly reduce the amount of time they keep identifying search logs, and will have to start treating IP addresses as personally identifiable data -- something that Google has been particularly vocal against.
Google has recently engaged in a major public relations effort to try and make a credible argument for keeping log data. The company has trotted out respected employee researchers to try and make the case that deleting such data will hurt search results. When all of their claims are analyzed, however, one thing becomes clear: It's all about the money (and the clicks).
Google has a genuine need to retain detailed log information on one kind of user: Those who click on ads. However, in order to avoid creating a situation where only clickers lose their privacy, the company logs data on all searchers instead. That is, the privacy of millions is threatened, to protect the incentive for users to click on ads.
The excuses
Over the last few months, a number of Google's engineers have issued public statements on the company's public policy blog to defend its much criticized log data retention policies. The company claims that the data can be used to hunt down malware, to catch people defrauding its advertising system, and can be used to improve search results, especially for localized results.
Google claims that accurate logging data can improve localized searches. This data is then used to intelligently respond to searches, such that a search for "GM" will result in General Motors related information for an American search user, yet someone in France be presented with information on "Guerre Mondiale" (World War).
What Google has done here, is attempt to muddy the waters of the debate. Yes, accurate logging data improves localized searches. However, the company does not need to retain the exact network address (known as an IP address) of each and every search. Instead of tracking my searches by my network address, 129.53.136.23, the company could instead log that I came from San Francisco, California. That, in itself, would be more than enough information in order to help it localize and improve search results.
Avoiding disincentives
Of all the excuses that Google's puppets have presented for retaining search logs, there is only 1 case where Google actually has a legitimate need to store information that identifies the individual user, and network address: advertising clicks.
Google is an advertising company first, and a search engine second. Sometimes, we forget this, but Google has a lot of bills to pay. After all, those free meals and massages for employees have to be paid for somehow.
Google displays text advertisements on all of its web search results pages. Advertisers, for the most part, pay per click. That is, every time a user clicks on one of the ads, Google charges an advertiser a few cents (or dollars, depending on the search term). Because of the amounts of money at play, this tends to attract criminals wishing to defraud the system. Thus, it is not terribly surprising that Google wishes to retain information on the user who clicked.
What is most interesting to note though, is that if a user does not click on one of Google's web advertisements, the only credible reason for retaining detailed search information becomes moot. If a user doesn't click, they can't possibly be engaged in fraud, and thus there is no reason to retain identifying information on the user's search.
Were Google to institute an information needs based logging policy, it would find itself in a curious position: users who clicked on advertisements would have detailed logs retained for months, if not years, while users who didn't click on ads would quickly have any identifying information scrubbed from logs, and replaced with more generalized info.
The obvious problem with such a scenario would be that of incentives, especially once the policy was made public. Users would lose their privacy each time they clicked on an advertisement. Unfortunately for the company, this is exactly the wrong kind of message to send. It wants to encourage users to click on its text ads, not to provide incentives for customers to skip them.
Thus, in order to not create that situation, and to avoid the disincentive to click on ads, Google logs data on every search, by every user. And because of this, we all suffer -- even those users who never even see ads, because they use technologies like AdBlockPlus and CustomizeGoogle.
Disclaimer: In 2006, worked as a summer intern in Google's click fraud team. Shuman Ghosemajumder, Google's "Business Product Manager for Trust & Safety" and the person claiming that search logs prevent fraud worked in the same team.
None of the information in this blog post involves confidential company information.
I was awarded a Google fellowship in both 2006 and 2007, for $5000 each time. Finally, I just returned from a Scholar Retreat in San Francisco, which the company paid for.
- prev
- 1
- next





