Big Blue is experimenting with an idea for customer databases called data randomization. The technique will, conceivably, preserve consumer privacy by masking data such as income, age, past purchases or medical information through mathematical calculations that can't be unwound.
For instance, if a customer submits their age as 38 when registering at an online shopping site, a randomizing plug-in in their browser software will add a number between minus 25 and 112 to their age and send that number over to the server.
The wrinkle is that, at the back end, computers then apply a barrage of calculations onto the scrambled data to discern patterns among all customers. The 38-year-old individual's true age can never be recovered, but an online business can somewhat accurately figure out how popular it is with 38 year olds. Unscrambled data collected by the company--such as how much a person paid for a car and on what date--could subsequently be randomized too, for additional privacy.
"The basic notion, in some sense, is kind of heresy in computer science. The normal notion is, in order to do a good job, you need to have accurate information," said Rakesh Agrawal, a senior fellow at IBM who is leading the research. "And here we are saying, 'You have good information, and we are going to perturb it or put errors into it to protect people's privacy.'"
A boon to privacy?
I find data randomization appealing on two levels. First, it's a healthy reminder of why we have big companies in the first place. They exist to hire the math geniuses and chemistry whizzes of the world, who in turn build the society of tomorrow. Without them, the Wheelo would stand as the apex of scientific achievement.
Second, it represents an opportunity to defuse the ugly conflict over privacy. A large--and seemingly growing--number of consumers are furious about how companies and institutions collect, trade and transmit their data.
In all reality, most of the harvested data is never exploited for nefarious purposes. Using an ATM card does create an electronic trail of your life, but it's not like the FBI agents are sitting around right now looking at your file and thinking, "He's eaten at Carl's Jr. three times in the last month. Wanna bet he goes there again in five days?"
Still, consumers resent the practice, and the Federal Trade Commission has made protecting consumer privacy a high priority.
To spoof data harvesting, people often lie, but that actually doesn't work. Companies can reconstruct basic data patterns. "It turns out that people are not very good at lying," Agrawal said. "Essentially people leave tell-tale signs."
The randomization system relies on determining the relationship between different values through Bayesian probability. Consumers fill in their true data, which then gets randomized before being sent over.
At the corporate end, servers then try to determine what type of randomizing calculations were applied to scramble the original values.
"We basically ask the following question: 'What could have generated this distribution?'" Agrawal said.
If the computer can come up with the likely randomizing technique that was employed--adding a random number between 15 and 87, or subtracting one between 8 and 32, for example--it can then draw a chart that accurately simulates what the customer base looks like. In several contained trials, the reconstructed curve differed from the curve plotted by the original data by two to three percent.
"It comes back to the true distribution, always. This is the beauty of math, fortunately or unfortunately," Agrawal said. "I think the key insight was that you don't have to have access to precise information to build good models."
IBM continues to conduct trials with the technology, but Agrawal already sees some areas where it could bring benefits. Large businesses such as rental car companies could pool their data without the risk of disclosing customer lists. Hospitals could give access to records about a hepatitis outbreak without being sued. Network break-ins would become potentially less dangerous.
And when filling out a customer questionnaire at Home Depot, you won't feel compelled to claim you have 16 kitchens.
Biography
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas. He has worked as an attorney, travel writer and sidewalk hawker for a time share resort, among other occupations.






"He's eaten at Carl's Jr. three times in the last month, let's sell him these large size pants. Oh yeah, we'll raise his health insurance premium too."
But I don't understand, what the point of having statistics at all is, if they're random, to an extent. Doesn't that defeat the purpose?
IBM are old-timers at this computing game. I learned programming on the mammoth old IBM 360 at Columbia University, because it was a pre-requisite for courses in Operations Research (applied stat and probability theory, wargames, et al).
Over the past twenty one years, life has taught me how critical those lessons were. IBM wrote the book on how to sidestep invasive info tech. The firm understood the potential threats to personal, corporate and government security, and potential abuse of basic civil liberties; all of this, before 1975.
Well done, Big Blue, as always... .
I have found that they could care less about your websurfing habits in the years I've enjoyed their fine publication.
All they care about is staying on top of their game as journalists.
And as journalists, the firm has no competition, and sides with no particular organizations; advertizer or not.
Their only critical Terms of Service are reporting well-written tech journalism
in real time, and making sure you don't post offensive nonsense.
- Not enough
- by November 24, 2004 6:03 PM PST
- If you can apply a mathematical formula, even with random numbers and even random functions, there will be a way to unravel it. It might take a ton of work, but if you can get all the encoded data, it will eventually be cracked. To advertise it as a perfect privacy solution, preys on the majority of people who have woeful mathematical skills.
- Like this Reply to this comment
-
-
- Reply
- by unknown unknown November 24, 2004 8:34 PM PST
- I think you may has misunderstood. The system IBM developed still allows to get statistics about their general user population but the randomizing prevents them from getting information about a specific user. If I give you my age plus a random number it would be very hard to get my age out of it, and know that you're correct without asking me. When you consider the number of entires in these databases it becomes quite impractical to even try.
- Like this
-
(7 Comments)