August 19, 2003 4:00 AM PDT

Newsmaker: Microsoft's in-house sociologist

See all Newsmakers
Microsoft's in-house sociologist
Ever get the feeling your Usenet newsgroup list is being watched? By Microsoft?

If so, consider yourself right. Thanks to the expertise of sociologist Marc Smith, Microsoft is keeping a close eye on newsgroups and other public e-mail lists, which it has identified as the Internet's undervalued "knowledge management application."

In Microsoft's research and development labs, Smith has spent the past several years slicing and dicing data about messages and message authors in an ambitious effort to help people make sense of the newsgroup manifold--the hordes of know-it-alls, flame warriors, spammers and neophytes who, by Smith's estimate, last year numbered more than 100 million in the Usenet network of e-mail threads, or newsgroups.

Smith's idea is that you can tell a lot about the quality of data by tracking its newsgroup contributors' social habits--a notion that holds promise for sorting through millions of messages, and peril for a online world increasingly skittish about invasions of privacy.

Following the launch of Microsoft's NetScan application for analyzing newsgroups and the people who post to them, Smith spoke to CNET News.com about NetScan, about Microsoft's interest in e-mail lists and about an application under development that would link objects in the real world to an array of online information.

How did a guy like you get to work for a company like Microsoft?
I'm a sociologist. I've now been at Microsoft Research about four-and-a-half years. Microsoft has a few social and cognitive psychologists, but I'm the only sociologist.

Which means what, exactly, in the context of technology employment?
A sociologist studies the attributes of relationships and the group of relationships that add up to a collective or a community. As a technology group, our mandate is to both explore and to build tools to study the phenomenon that we could call online community. We sociologists don't like to use the term "community," particularly--we like to refer to them as social cyberspaces.

What's wrong with "community"? The word seems to come up all the time when we talk about the Internet.
When we say "community," perhaps what we really are looking at is a special case of a broader phenomenon that sociologists call collective action, when a group of people do something together. And this turns out to be the No. 1 thing people do with their computers: It's to send each other e-mail. The No. 2 thing is to send groups of people e-mail--to join the list of people who like to knit, or who like Microsoft products.

So why exactly does Microsoft need a resident sociologist?
Microsoft has a big investment in online communities, and has not had until recently many tools to enhance that investment. What Microsoft wants around communities is what every enterprise does, which is a peer-support, knowledge-management application. And that means that if you go into Usenet, you'll find 3,000 Microsoft public newsgroups, with 1.5 million people posting 10 million messages. And that's 2002--and it's going to more than double this year, because it more than doubled in '01. We don't see traffic flagging at all.

My impression was that the use of e-mail lists was on the decline.
To the contrary! It's on the rise. Usenet alone--which is a backwater in that most people don't know where it is and how to find it--on Usenet alone there were 13.1 million unique identities who used Usenet in 2002, and by that we mean that they were a contributor and wrote at least one message. How many people read the message? We have no idea. That number is invisible and is fragmented over a half-million servers that are not sharing their data. But conservatively you could estimate that there are 10 readers for every writer, so that makes it 130 million Usenet users per year. And that's a small number compared to majordomo lists, or things like Yahoo Groups, and the number of people who have a bulletin board on things like UltimateBBS.

What are you doing with these lists, from a sociological standpoint?
What we are about is the thread. It turns out that the core sociological data type of the Internet is not IP (Internet Protocol) numbers, or any of that stuff, it's threaded conversations. And it's amazing how little

It turns out that two-thirds of all threads in Usenet, in 2002, had a whopping two messages.
investment has been put into adding value to the core data structure of the Internet, which is the conversational thread. I can illustrate that by suggesting that when you sit in front of your e-mail client, simply try to sort your messages by thread size.

And by size of the thread you mean...?
I mean the number of messages, the number of generations of messages, the breadth of the conversation. If eight people reply to a message, it has a breadth of eight. If 12 reply, it's 12. And it turns out that the frequency distribution of thread properties is very illuminating.

It turns out that two-thirds of all threads in Usenet, in 2002, had a whopping two messages. And two-thirds of all authors are the people who write a message, post once one day, and never again.

Is that indicative of a spam problem?
No, those aren't spammers, they are the people who post once, get their answer and go away happy. They post a message that says they can't print, then they get their answer. What newsgroups are is a form of knowledge management application. What they are about is leveraging the collective knowledge of large numbers of people.

So how is it useful to know that people are getting their printing questions answered? What can you do with that information?
What you can do is say, "Let's look at how many times each of those unique IDs posted. Twenty-four million times? That's your spammer." Humans have a limited capacity to type and send and think up messages, while software is virtually free from those constraints. What we do is say, "By looking at these properties, the structure of authors, threads and newsgroups, we can determine a lot of things that are good predictors of value."

Here's an example: Let's say you have a newsgroup with 22,000 messages posted there per month. You have a problem! What should you read? We have some suggestions. In an existing browser, you can see the messages sorted by date, sorted by size or sorted alphabetically, and this is not very useful. What we want to say is, "There are different vectors through this content space, different ways of slicing into the data, the conversation, that are more likely to bring valuable information."

For instance, what are people talking about? What we've done is highlight the 40 threads that got the most number of messages in this period--day, week, month, year. And we'll say, "Here are 40 really big threads." How do you know those are good? We're not sure they were good, but these were the things that got people really excited and engaged in this newsgroup. That's one vector.

But what about the guy who gets his printer fixed in two messages?
And you can legitimately argue that. "What about small threads of high value? How can you help me find them?" The answer is that we are, by leveraging latent structural data that is itself a product of collective behavior. You have lots of individuals working on their own. If there were only one person writing Web pages, Google wouldn't work. But Google Groups doesn't do what we do to Usenet. We're doing something useful to Usenet. We're not yet a search engine, we're a research project. And we will eventually be doing things related to the full text of the message.

Let's look at the individual who posts to a list. Does he show the pattern of participation over time that is an indicator of a valuable contributor? The question you should raise is, "What do you mean by value?" One man's flame warrior is another man's poet. It's not for us to tell you. But we do give you tools to sort patterns of difference.

Let me tell you how to find someone who gives really good technical support answers using our author tracker. It's a way to slice a vector into the content space that measures how dedicated are the people to this newsgroup. Basically, it asks, "Are you a regular?"

And what will that indicate?
Regulars are value contributors. But you could say, "You are sorting people by--and we do--how many days they come back." For example, you go into some of our tech support newsgroups, and you'll find that there are

I'm a social scientist--I don't know the difference between good and bad, only the difference between difference.
people who have contributed every day in the month. OK, those are regulars. But how do you know they have value? It's not just the number of days you come back. There are three other metrics, which tend to be ratios. One is the ratio of replies: How many times did you reply to someone else, or start a thread? Spammers may show up every day, but they don't reply. With a very low reply-to-post ratio, I would say that that is a person who starts a lot of conversations but never replies to anyone else, and it's probably a spammer. Showing up every day is not enough--you have to respond to other people. It's also thread-to-post. How many threads did you touch, how many messages did you write? If you wrote 10 times, all into one thread, that's a low ratio. You have a high conversational concentration.

Is that good or bad?
I'm a social scientist--I don't know the difference between good and bad, only the difference between difference. Do I like flame warriors? Or don't I? A high reply-to-post indicates a flame warrior, because they tell you you're an idiot and they put all their messages into a few threads--so they also have a low thread-to-post ratio.

If you want to find the answer person, flip that ratio around. They differ from the flame warrior in the following way: Both show up every day, and both reply. The answer person answers a post once or twice, then moves on. We've seen people post 500 messages in one week in one thread. If you have that much time on your hands--it's not to say that it's a good thing or a bad thing, but a different thing. We give you the opportunity to say, "I just came here because I can't print." We will guide you to the very real group of people who are dedicated, for whatever reason, to not just computer technology, but answering questions about knitting, horseback riding, dogs--you name it. And the way to do that is to start looking at the social accounting metadata about authors.

So could all of this ultimately add up to a better search engine?
If things go well, we'll have a better search engine. This remains early, initial research, but our results look promising. Reranking results based on social histories does do a better job, and I do believe we will deliver interfaces that will find people who are debators, fine, but also those who are answer people...It turns out that people have a lot to give each other. There's a lot of knowledge to share, and 2 percent of every population is motivated to be a knowledge sharer.

Most of us have to rely on signs or symbols that suggest a person is reliable. With doctors you have their diplomas, the way the office looks, and most important, who referred you--these are all indicators that we rely on. We are trying to create analogous tools for online environments where that data is latent, is not manifest in the interfaces visibly.

When you talk about a reputation system, I'm reminded of the eBay system.
We're similar but different--eBay is an explicit feedback system, and we are an implicit feedback system. With eBay, buyers rate sellers, and sellers rate buyers, after they conduct a transaction. It's what people say about you. But there are real problems with this--most of all inflation, the "Beverly Hills-adjacent" problem. If you read the L.A. real estate section, everything is "Beverly Hills-adjacent." So there is this tendency to inflate. There have been empirical studies of reputation ratings at eBay that suggest that just going by reputation ratings at eBay is not an indication that you're not going to get a fraudulent transaction.

Tell me about the AURA (Advanced User Resource Annotation) project.
AURA is about extending NetScan: "What if you could use NetScan with a pocket computer and attach threads to things?" We use the Toshiba e740 and a Compact Flash bar-code reader, run AURA software, and can walk up to any bar-coded object, any ISBN-coded object, scan it, and the device brings back information about that object?We imagine being able to walk up and down the aisle of a grocery store and have a handheld computer rate everything with a green light, a red light, a skull and crossbones.

In Hong Kong, during the height of the SARS outbreak, there was a system that could tell you which buildings had had confirmed SARS cases. Now that's a reputation system.

It's easier to do this with products than with, say, people.
People are one thing, but objects--all the books on my shelves, all the food in my kitchen, the artworks in the hallway--we at Microsoft have bar-coded every one of them. AURA is going to become a navigation tool. You can print a bar code for a penny and slap them on things. Which we do--and then Facilities comes along and scrapes them off.

It seems that once Microsoft starts tracking the behavior of individuals, you're asking for trouble. What about privacy?
I think it's a very important thing. And we have build NetScan to protect what I think are legitimate claims for privacy. Like a Net spider, NetScan takes publicly accessible documents off the Internet, and it respects metadata that says "Leave me alone!" There is the robots.txt file that says, "You can look at this but not that." With Usenet there is one that says "Leave my messages alone," and we respect that. We will not store your messages if you put that in them.

Couldn't a spammer just put that in his or her messages, so you wouldn't be able to identify them as a spammer?
That's a possibility, and that's something we would have to respect. But the system still would not fail, because a person with no reputation is a person who has a reputation. "Let me tell you about the people who the system has shown to have value." We're about letting the cream float the top and not about letting the other stuff sink.

How can you reassure someone who might be concerned that it's not such a good idea for computers to be keeping track of our belongings and our whereabouts?
I'm not sure, but we're leaking data all over the place now. And on the one hand, that has utility for other people. On the other, there's a privacy risk. In some ways, consider us a form of performance art. Would you like to see you? This is potent. We accept that and hope we can offer people good prophylactics against loss of privacy. And that may mean keeping multiple IDs and e-mail addresses. Ultimately we may have to fragment our identities.

More Newsmakers

 

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

What's Hot

Discussions

Shared

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.