• On CBS MoneyWatch: Report: Tiger to Pay Wife $60 Million

Outside the Lines

Read all 'data analysis' posts in Outside the Lines
September 22, 2008 7:12 PM PDT

Here come the numerati

by Dan Farber
  • 1 comment
Share

The pile of digital data is growing, doubling every 18 months or less. That pile is the new gold, drawing data miners hoping to strike it rich by finding patterns and uncovering insights that can lead to more efficient markets, higher productivity, safer streets, and the much loved increased profits.

Stephen Baker's new book, The Numerati (Houghton Mifflin), introduces some of the data miners, or numerati, who are leading efforts to probe the depths of the global data dump.

He profiles several numerati, focusing more on the personalities and potential use cases than the arcane details of the computer science and mathematics. Baker, who has written for BusinessWeek for more than 20 years, paints a rich portrait of how the flood of data and the efforts of the numerati will transform shopping, marketing, politics, health care, matchmaking, work, medicine, and other disciplines.

"Just as they've helped medical researchers find genetic markers pointing to certain types of breast cancer and Huntington's disease, they might tell grocers what type of fruit to promote to buyers of canned food or what kinds of magazines dog-food buyers tend to read," he wrote.

IBM researcher and featured numerati Samer Takriti is building detailed mathematical models of 50,000 of his colleagues. Baker describes Takriti's ultimate goal as follows:

"The goal here is to build entire models, complete with each person's quirks, daily commute, and allies and enemies. These models might one day include whether they eat beef or pork, how seriously they take the Sabbath, whether a bee sting or a peanut sauce could lay them low. No doubt, some of them thrive even in the filthy air in Beijing or Mexico City, while others wheeze. If so, the models would eventually include this detail, among countless others. Takriti's job is to depict flesh-and-blood humans as math."

In practical application, data processed by numerati from calendars, instant messaging, e-mail, cell phones, social networks, project records, resumes, and other sources could render a digital portrait of each worker. Machines could handily determine the optimal group for a specific project, taking into account budgetary, geographical, and other constraints.

The data could also be used to ferret out employees who aren't fulfilling their productivity quotient or are bypassing the chain of command. Companies have technology installed to monitor e-mail for spam, porn, and other abuses, they might as well use it to see what people in the company are thinking, Baker told me in a conversation last week. He acknowledged the significant privacy issues that go along with unleashing numerati on the world of data and addresses the issue in his book:

"At work, perhaps more than anywhere else, we are in danger of becoming data serfs--slaves to the information we produce," he wrote.

"Part of what needs to be calculated is how much this freaks out workers. It impacts productivity and the morale of employees. If a big technology company gets a reputation for monitoring every keystroke, the smart people will choose to work elsewhere. Companies have to figure out what works and what is overkill or freaks people out," Baker told me.

He states in his book that the "mathematical modeling of humanity....promises to be one of the great undertakings of the twenty-first century." This concept could be applied to Google and other companies who are extracting and analyzing billions of digital signals generated by individuals and groups.

Just because computer science and applied math makes data divination possible, the means don't necessarily justify the ends. The same technology used to determine the mathematical model of a terrorist or poor performer in the workplace can be used to violate the privacy and rights of unsuspecting, innocent people.

Baker told me in our conversation that we need tools to decide what information to share and with whom. Some of the social networks and major Web sites are working on that problem, but the solutions so far are inadequate. We'll need a generally accepted Bill of Rights for personal data to give the numerati and their overseers guidance on how to avoid "evil" in the evolving digital world. Of course, that is wishful, optimistic, and, perhaps, naive thinking.

May 20, 2008 8:20 AM PDT

Aster Data Systems offers cluster for deep insights

by Dan Farber
  • 2 comments
Share

Taking a cue from Google, Aster Data Systems has come up with an massively parallel processing analytical engine and cluster of commodity hardware for extracting insight from hundreds of terabytes of data. MySpace has deployed 100 nodes of the Aster "nCluster" to load millions of rows per second to surface trends that can help the company fine-tune its services.

Aster nCluster nodes consist of 16GB of RAM, four 250GB SATA disks, and dual-processor quad-core Intel Xeon systems interconnected via 24-port 1Gb Ethernet switches. It works with the popular business intelligence and ETL tools, and it can talk to the standard relational databases.

The secret sauce, according to the company, is patent-pending algorithms and processes for partitioning, balancing, replication, and querying across nCluster. Pricing is based on the amount of customer data processed.

Aster's architecture is structured in independently scalable tiers, each of which adds a degree of freedom to the customer. The Aster Worker tier, where data is stored on locally attached disks, can be scaled to increase query performance and volume. The Aster Loader tier can scale independently to increase load throughput. This enables massively parallel processing for extraction and loading. Once the data is loaded, user queries are intelligently routed to each node to process only relevant data. This enables query load balancing to eliminate hot spots and increase performance, returning results in seconds or minutes versus hours (or incomplete results). Source: Aster Data Systems

In the tradition of Google, Yahoo, and other Silicon Valley start-ups, Aster was co-founded by three Stanford computer science Ph.D. students and funded by Silicon Valley VCs and angel investors.

Aster is entering a crowded field (see below), but its Google-like approach to data warehousing could reset expectations.

(Credit: Gartner)

See also: HP takes aim at Teradata with Neoview mousetrap

March 26, 2008 5:10 AM PDT

Mail Trends looks deep into your in-box

by Dan Farber
  • 1 comment
Share

Sorting out the overload of e-mail is one of the mostly unsolved problems of computing. The first step is analyzing your in-box, which is what Google developer Mihai Parparita has done with Mail Trends, a program that lets users analyze and visualize their inbox.

Mail Trends, which is similar to Google Reader Trends, extracts data from IMAP servers and displays statistics such as distribution of messages by year, month, day, day of week, and time of day; distribution by message size; a breakdown of top senders, recipients, and mailing lists; distribution of senders, recipients, and mailing lists over time; and distribution of thread lengths and the lists and people that result in the longest threads.

Via Googlified

An example of Mail Trends output running a small portion of the Enron Email Dataset, a corpus of about 500,000 messages that was made available by the Federal Energy Regulatory Commission during its investigation of the Enron.

(Credit: Google)

Parparita notes that Mail Trends is at an early stage of development. It currently lacks support for non-Gmail servers and the capability to split out sent and starred e-mail. You can follow progress on the project on this Mail Trends page.

What's further missing is turning the analysis into proactive in-box management, a software agent that automatically sorts your in-box, makes calendar appointments, and routes messages.

Startup Xobni ("inbox" spelled backwards) is attempting to manage e-mail overload for Microsoft Outlook users. It includes some data analysis, such as how users and their contacts use e-mail, as well as some more proactive features. For example, Xobni shows recent e-mail conversations and files exchanged with a contact, and a list of related contacts. It also predicts when you would be most likely to get a response from a contact.

Microsoft Research has been working for years to come up with what it calls "e-mail triage." Apparently, Microsoft hasn't been able to turn the research into product. TechCrunch has suggested that Microsoft is in negotiations to acquire Xobni.

While Mail Trends is interesting to look at, Mail Triage would be much more useful. With all those engineers at Google devoting 20 percent of their time to personal projects, solving the Mail Triage problem would be a good way to get promoted and improve Gmail.

Via Googlified
March 9, 2008 8:05 AM PDT

Presidential election insight via data visualization

by Dan Farber
  • Post a comment
Share

Dow Jones Insight is applying text analysis to thousands of documents to measure trends, such as favorability and issue coverage of the presidential candidates over time. In the first example below, Dow Jones Insight parsed 26,435 documents, including English language newspapers, magazine, transcripts from broadcasts and news wire services.

Favorable and unfavorable ratings are assigned based on the words in proximity to a candidate's name. Neutral documents are excluded.

(Credit: Dow Jones Insight)

Differences in domestic issue coverage between Obama and Clinton is negligible, while "terrorism" and "health care" show a noticeable difference. Obama has more mentions in close proximity to terrorism to Clinton's edge in health care.

(Credit: Dow Jones Insight)

Dow Jones is pitching the service as a generalized tool for analyzing the impact of media coverage, and targeting corporations, as well as political candidates, who need competitive insight. Other services, such as Nielsen BuzzMetrics and BuzzLogic, provide similar kinds of media analysis based on parsing a variety of document types.

via Paul Kedrosky

  • prev
  • 1
  • next
advertisement

The yogurt makers of tech: Gadgets to avoid

Don't buy these one-trick ponies--unless you like gizmos that gather dust.

Google wants to unclog Net's DNS plumbing

The Net giant, ever eager for a faster Internet, debuts its Google Public DNS service. With it, Google could become even more central to the Net.

About Outside the Lines

Dan Farber is the editor in chief of CNET News. He has covered technology for more than two decades, and he previously served as editor in chief of ZDNet, PC Week and MacWeek. Outside the Lines explores the intersection of business and technology.

Add this feed to your online news reader

Outside the Lines topics

Subscribe to the EIC² podcast

Editors Dan Farber of News.com and Larry Dignan of ZDNet, square off in EIC² in this weekly podcast. The two editor in chiefs talk about the big tech stories of the day and provide insight and analysis.

Subscribe to this podcast using an RSS reader other than iTunes

Subscribe to this podcast using iTunes

Most Discussed



advertisement

Inside CNET News

Scroll Left Scroll Right