MOUNTAIN VIEW, Calif.--Google's Mike Cohen won't be satisfied until anyone who wants to talk to their computer can do so without laughing at the hideous translation or sighing in frustration.
Cohen, a leading figure in speech technology circles, heads up Google's efforts to advance the science of speech technology while applying it to as many products as possible. "Google's mission is to organize the world's information, and it turns out a lot of the world's information is spoken," Cohen said, in a recent interview with CNET about the search giant's speech ambitions.
Google is attempting to produce voice-recognition technology that fits in with its view that the computing universe is shifting toward mobile devices and browser-based applications. That is, easy-to-use software that does the heavy lifting at the data center in order to run over the Internet on a mobile device with limited hardware.
Computer speech recognition seems like it has been five to 10 years away for decades. Indeed, the electronics and computer industries have been chasing the goal of voice-directed computers for nearly 100 years, when a simple wooden toy dog released in 1911 called Radio Rex first captivated children and adults by responding (at least some of the time) when his owners called for "Rex!" by shooting out of a doghouse. (Cohen owns one of the few remaining gadgets.)
Huge advances have obviously been made since the 1920s, yet few of us use our computers like HAL in "2001: A Space Odyssey" or KITT, the computerized car in "Knight Rider." Cohen, however, believes the industry is about to silence the jokes about amusingly garbled voice mails as speech recognition models grow more sophisticated, engineers pack mobile computing devices with more sophisticated hardware, and users start to realize that performance has made great strides.
"The goal is complete ubiquity of spoken input and output," Cohen said. "Wherever it makes sense, we want it to be available with very high performance."
They can hear you now
Cohen, who founded speech technology company Nuance Communications before coming to Google in 2004, has been working in this field for 26 years. At Google, his job has been to apply cutting-edge speech recognition and synthesis technology to Google services, starting with GOOG-411 in 2007 and voice search in 2008.
At this point, most leading speech-technology systems have settled on a basic architecture, Cohen said. The first step involves analyzing incoming sound waves in 10-milisecond batches, identifying subtleties in pitch and range to create a digital representation of those sounds. Then comes the hard part, taking those batches and attempting to match them against the billions of combinations of sounds that make up words in the English language. (The process is the same for other languages, but the number of sound combinations differs from language to language).
"It's fundamentally a big statistical model," Cohen said. Google's method and other speech-recognition models analyze the sounds for their acoustic quality to identify "phonemes," (a basic sound unit of a word, such as "ooo" in "Google"), how those phonemes form individual words, and how grammar informs the construction of those words into sentences.
In terms of its basic approach, Google's not doing anything different than others who implement speech technologies. Nuance's Dragon Naturally Speaking enjoys quite a following among those interested in this area. Microsoft and Apple have spent tons of time and money researching voice-recognition technology in their desktop operating systems for years. Start-ups like Vlingo are putting such technology on mobile computers.
Naturally, however, Cohen thinks Google has a few advantages.
Time and data
Speech recognition is an extremely compute-intensive problem, with a lot of resources required to decode even simple voice commands or requests in seconds. Fortunately for him, Cohen happens to work for a company with one of the world's largest reservoirs of computing resources.
And as everyone knows, Google has accumulated a vast amount of data on human speech patterns, both from the queries people type into its search engine every day as well as the more than 10 million books it has digitized as part of its Google Books Search project.
The combination allows Google to manipulate very large data sets when it is processing speech recognition queries, and that is "one of the reasons that we've some big advances," Cohen said. He thinks Google can deliver more accurate results in a quicker amount of time because of this ability to crunch huge amounts of new data and verify it against older data.
Google's most visible work has shown up in its Android mobile operating system, where Android users can click on a little microphone button on the home search page to use their voices to search the Web or launch certain applications. At an event earlier this month, Google mobile product managers said Android users are placing about 1 out of every 4 search queries using the microphone.
But Google has also released technology that gives YouTube users a way to automatically caption their videos. Its Google Voice application transcribes voice mails left on Google Voice accounts into text, occasionally with hilarious results. And Google told The Times in the U.K. that it's working on a "translator phone" that would let users speak a sentence into the phone and have a translated version repeated over a speaker.
Few would argue, however, that Google or anyone in the industry has achieved truly reliable speech-recognition technology. What's holding the company back?
The most basic issue at the moment is simple background noise, Cohen said. Mobile users on the go face interference from wind, background conversations, or traffic noise that can distort the sounds captured in that first part of the recognition systems. Better microphones could help, but the systems have to get better about dealing with such interference, Cohen said.
Another major problem is the complexity of anticipating what people might say and accurately synthesizing that into text. This isn't just about accents or dialects (Cohen, in a wry Brooklyn accent, recalled a speech technology professor who warned him that no one speaks correctly), but just that nicknames, slang, and rushed or incomplete sentences can confuse the smartest algorithms.
Google has noticed that people use voice search the way they search on Google, speaking in keywords and phrases like "restaurants in Palo Alto." That makes it easier to predict what a collection of sounds means in a search context, since it can cross-reference the speech it has synthesized against a database of search queries. Voice mails, on the other hand, are almost completely unpredictable, especially because Google does not maintain a similar database of voice mails due to privacy concerns, he said.
So while plenty of challenges remain, there's a sense both inside and outside Google that speech technology is on the cusp of becoming something people expect rather than a feature that a few devotees covet. It make take some getting used to, but we're already seeing people abandon computer input methods designed for another era--the keyboard and mouse--in favor of touch screens and voice commands.
It's not about "killing" an older input method, it's about providing alternatives. "You just want people to assume that if they feel like talking, they can, and if they feel like typing, they can," Cohen said.