Microsoft aims to build a better thesaurus
A team of researchers at Microsoft is looking to beat Roget at his own game.
Aiming to build a better thesaurus, the Writing Assistance project within Microsoft's research unit is tapping techniques developed to translate from one language to another.
Although thesauri are good at finding lots and lots of synonyms, they require the user to pick the right one because they aren't very good at understanding the context of what is being said. That's where the experience from doing machine translations comes in.
Brockett
(Credit: Microsoft )"We've taken the actual translation tables...and what we've done is we've taken those and said if a word in Chinese maps to two different English words maybe those two words are synonyms with some probability," said Christopher Brockett, a computational linguist and one of the Microsoft researchers leading the project.
The approach has two key benefits over a static thesaurus. First of all, the newer approach can do phrases, as opposed to single words. Also, it can draw on the context in which the phrase is used.
Brockett plans to show off a prototype of the tool next week at TechFest, Microsoft's annual internal science fair. It's just one of dozens of projects that will be shown as part of an effort to expose Microsoft's business units to the work being done in Microsoft's research labs. (Check back next week for CNET's on-the-ground coverage of the event, which kicks off Monday night at Microsoft's campus in Redmond, Wash.)
TechFest is sort of like "The Dating Game" for Microsoft's research and product development arms. Research teams at Microsoft set up booths, somewhat like a high-school science fair, while product teams shuffle through looking for something that might give their efforts a leg up on the competition.
For the public, TechFest can also offer a glimpse at future product directions. For example, researcher Andy Wilson showed off a number of surface computing projects in the years leading up to the debut of Microsoft's Surface product.
As is the case with most of the projects, the thesaurus effort is still in its infancy.
"We're still working on the algorithms and how much work we give to the language pairs," Brockett said. "We have to get the quality up. There are usability issues that have to be looked into."
Over time, though, Brockett hopes the technique could be used to effectively translate whole sentences. Microsoft has a demonstration of that up on its Web site, but Brockett acknowledges such a treatment shows both the potential and the current limitations of the technology.
But would-be high-school plagiarists beware. Yes, the technology could someday translate the whole Wikipedia article for you, but it would likely translate the article the same way for all your classmates as well. And plagiarism detection software is evolving along with the science of machine translation.
As for the thesaurus itself, the technology would be a natural fit for Word, which already has a built-in traditional thesaurus. But the technology could also help Microsoft in another key area: search.
That's because while search engines are good at finding things like names, that have just one form, they have a harder time finding expressions that can be phrased in multiple ways.
That's less of an issue when searching across the whole Web. For example, searching "Who shot Abraham Lincoln?" "Who killed Abraham Lincoln" and "Who assassinated Abraham Lincoln" all direct you to a page with John Wilkes Booth.
However, when it comes to searching smaller universes, such as a company's intranet, that might not be the case.
"You might not find it if the words are different," Brockett said. In such cases, automatically searching using similar phrases might boost the likelihood of finding a result.
During her years at CNET News, Ina Fried has changed beats several times, changed genders once, and covered both of the Pirates of Silicon Valley. These days, most of her attention is focused on Microsoft. E-mail Ina. 






- by TomKnorr February 25, 2009 1:21 PM PST
- It's about time that Microsoft is putting the pieces together, fashionably late as usual. We are talking ontology based machine translation here. I wonder if they find all the other gems that are in this technology. <br />We have been working on machine generated textual descriptions from conceptual knowledge for several years now. No Wikipedia, all presented information is machine generated and presented in any language, from the concept knowledge, not a translated word list. <br />We are also talking of user interfaces that "know", that are conceptually aware of what the user has selected on the screen, that "know" what information is missing, that learn when a new fact is added to a concept category and just about start asking questions themselves. We are talking about user interfaces that can be conferenced with native speakers all over the world showing the same subject pages in their native language at the same time.
<br />
<br />The question of "who shot PA" is obviously splitting into 2 (at least) alternate translations - one will not make sense in the remainder of the story. You cannot just translate a sentence and leave it stand alone. The MT will eventually have the knowledge of the whole story as a concept - the idea of the story -and will be able to rephrase the idea in the target language. This will not create you a "literal" translation in the sense of what a human translator would create but it will grasp the essence of the original and present it in the target language - and (our system) at some point will know more about the original's subject than the originator of any story to translate.
<br />Welcome to the club, this is what the next generation internet is all about.
- Like this Reply to this comment
-
(11 Comments)