Version: 2008

November 24, 2004 10:26 AM PST

Perspective: Privacy's random answer

See all Perspectives
Privacy's random answer
If IBM is right, corporate databases in the future might record your age as 157 and your income as the square root of two.

Big Blue is experimenting with an idea for customer databases called data randomization. The technique will, conceivably, preserve consumer privacy by masking data such as income, age, past purchases or medical information through mathematical calculations that can't be unwound.

For instance, if a customer submits their age as 38 when registering at an online shopping site, a randomizing plug-in in their browser software will add a number between minus 25 and 112 to their age and send that number over to the server.

Randomization represents an opportunity to defuse the ugly conflict over privacy

The wrinkle is that, at the back end, computers then apply a barrage of calculations onto the scrambled data to discern patterns among all customers. The 38-year-old individual's true age can never be recovered, but an online business can somewhat accurately figure out how popular it is with 38 year olds. Unscrambled data collected by the company--such as how much a person paid for a car and on what date--could subsequently be randomized too, for additional privacy.

"The basic notion, in some sense, is kind of heresy in computer science. The normal notion is, in order to do a good job, you need to have accurate information," said Rakesh Agrawal, a senior fellow at IBM who is leading the research. "And here we are saying, 'You have good information, and we are going to perturb it or put errors into it to protect people's privacy.'"

A boon to privacy?
I find data randomization appealing on two levels. First, it's a healthy reminder of why we have big companies in the first place. They exist to hire the math geniuses and chemistry whizzes of the world, who in turn build the society of tomorrow. Without them, the Wheelo would stand as the apex of scientific achievement.

Second, it represents an opportunity to defuse the ugly conflict over privacy. A large--and seemingly growing--number of consumers are furious about how companies and institutions collect, trade and transmit their data.

In all reality, most of the harvested data is never exploited for nefarious purposes. Using an ATM card does create an electronic trail of your life, but it's not like the FBI agents are sitting around right now looking at your file and thinking, "He's eaten at Carl's Jr. three times in the last month. Wanna bet he goes there again in five days?"

Still, consumers resent the practice, and the Federal Trade Commission has made protecting consumer privacy a high priority.

It turns out that people are not very good at lying. Essentially, people leave tell-tale signs.
--Rakesh Agrawal, senior fellow, IBM

To spoof data harvesting, people often lie, but that actually doesn't work. Companies can reconstruct basic data patterns. "It turns out that people are not very good at lying," Agrawal said. "Essentially people leave tell-tale signs."

The randomization system relies on determining the relationship between different values through Bayesian probability. Consumers fill in their true data, which then gets randomized before being sent over.

At the corporate end, servers then try to determine what type of randomizing calculations were applied to scramble the original values.

"We basically ask the following question: 'What could have generated this distribution?'" Agrawal said.

If the computer can come up with the likely randomizing technique that was employed--adding a random number between 15 and 87, or subtracting one between 8 and 32, for example--it can then draw a chart that accurately simulates what the customer base looks like. In several contained trials, the reconstructed curve differed from the curve plotted by the original data by two to three percent.

"It comes back to the true distribution, always. This is the beauty of math, fortunately or unfortunately," Agrawal said. "I think the key insight was that you don't have to have access to precise information to build good models."

IBM continues to conduct trials with the technology, but Agrawal already sees some areas where it could bring benefits. Large businesses such as rental car companies could pool their data without the risk of disclosing customer lists. Hospitals could give access to records about a hepatitis outbreak without being sued. Network break-ins would become potentially less dangerous.

And when filling out a customer questionnaire at Home Depot, you won't feel compelled to claim you have 16 kitchens.

Biography
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas. He has worked as an attorney, travel writer and sidewalk hawker for a time share resort, among other occupations.

More Perspectives

Add a Comment (Log in or register) (7 Comments)
  • prev
  • 1
  • next
Nefarious use?
by November 24, 2004 1:04 PM PST
If the data is available to anybody, the discussion will be more like
"He's eaten at Carl's Jr. three times in the last month, let's sell him these large size pants. Oh yeah, we'll raise his health insurance premium too."
Reply to this comment
Randomization
by November 24, 2004 2:35 PM PST
First of all, was my information randomized when I sent it to register with news.com? :-P
But I don't understand, what the point of having statistics at all is, if they're random, to an extent. Doesn't that defeat the purpose?
Reply to this comment
Reply
by unknown unknown November 24, 2004 8:40 PM PST
The random number added to the data is within a finite range. If you take enough samples, despite having introduced a bit of randomness, it will still come out as approximately normally distributed. In other words they can make inferences about population means using the normal distribution no matter what the distribution of the population being sampled from (the central limit theorem if I remember my statistics).
Thank the Lord IBM Understands
by malabrm1 November 24, 2004 4:05 PM PST
Over the years, I've told Amazon that I'm 952 y/o, Yahoo that I'm 6 y/o, and other sites such nonsense just to avoid receiving their spam and tracking codes. So, sue me.

IBM are old-timers at this computing game. I learned programming on the mammoth old IBM 360 at Columbia University, because it was a pre-requisite for courses in Operations Research (applied stat and probability theory, wargames, et al).

Over the past twenty one years, life has taught me how critical those lessons were. IBM wrote the book on how to sidestep invasive info tech. The firm understood the potential threats to personal, corporate and government security, and potential abuse of basic civil liberties; all of this, before 1975.

Well done, Big Blue, as always... .
Reply to this comment
BTW...Re: CNET...
by malabrm1 November 24, 2004 4:21 PM PST
You can rest assurd this site is as cool as it comes about respecting your privacy.

I have found that they could care less about your websurfing habits in the years I've enjoyed their fine publication.

All they care about is staying on top of their game as journalists.

And as journalists, the firm has no competition, and sides with no particular organizations; advertizer or not.

Their only critical Terms of Service are reporting well-written tech journalism
in real time, and making sure you don't post offensive nonsense.
Not enough
by November 24, 2004 6:03 PM PST
If you can apply a mathematical formula, even with random numbers and even random functions, there will be a way to unravel it. It might take a ton of work, but if you can get all the encoded data, it will eventually be cracked. To advertise it as a perfect privacy solution, preys on the majority of people who have woeful mathematical skills.
Reply to this comment
Reply
by unknown unknown November 24, 2004 8:34 PM PST
I think you may has misunderstood. The system IBM developed still allows to get statistics about their general user population but the randomizing prevents them from getting information about a specific user. If I give you my age plus a random number it would be very hard to get my age out of it, and know that you're correct without asking me. When you consider the number of entires in these databases it becomes quite impractical to even try.
(7 Comments)
  • prev
  • 1
  • next
advertisement

Latest tech news headlines

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.

More feeds available in our RSS feed index.

Markets

Market news, charts, SEC filings, and more

Related quotes

IBM (0.00%) 0.00 127.91
Dow Jones Industrials (0.20%) 20.63 10,328.89
S&P 500 (0.58%) 6.39 1,102.47
NASDAQ (1.45%) 31.64 2,211.69
CNET TECH (1.20%) 19.13 1,607.26
  Symbol Lookup
advertisement

Inside CNET News

Scroll Left Scroll Right