Roger Ebert's search to recapture his lost voice uncovered a company with a unique technology.
When the famed film critic needed to find a way to communicate after losing his voice to cancer surgery, he turned to text-to-speech (TTS) software that speaks whatever he types. But the TTS software he initially tried sounded too robotic and computerized. He wanted a voice that sounded like him. That's when he discovered CereProc, a Scottish company that builds electronic voices. Using someone's audio recordings, CereProc's technology can stitch together an entire digital voice that sounds like the actual person.
To create a full and accurate layer of speech, CereProc typically brings people into its professional recording studio to read specific voice scripts for several hours. That audio is carefully recorded and controlled to make sure it's as clean and consistent as possible. But Ebert had only the audio from commentaries he made for several films on DVD. The challenge faced by CereProc was trying to stitch together his voice from audio that was limited in length and poor in quality.
Ebert's new voice made its first TV appearance on Tuesday's Oprah Winfrey show where the film critic and his wife Chaz spoke with Oprah and appeared in a taped segment revealing their life at home. Hearing her husband's voice for the first time in several years brought tears to Chaz and smiles to Roger.
CereProc creates and sells a variety of different voices with various accents, dialects, and personalities. People use CereProc's voices and text-to-speech software for a variety of reasons. Some, like Roger Ebert, have lost their own ability to speak. Many people use it to learn English and other languages. Some want to capture a local dialect before it dies out. I use TTS software as a proofreading and editing tool to listen aloud to my own writing.
To learn more about CereProc's software, I recently spoke with Chris Pidcock, the company's chief voice engineer.
Q: Chris, how do you actually create someone's voice using your technology?
Pidcock: When we build a voice for ourselves, we have a special script that covers lots of the sounds of English. It's quite rich and detailed. We get people into a studio and spend 15 hours or so recording them. But afterwards, the voice creation process can be performed on audio and text from anywhere. That's how the Roger Ebert project came about. He got in touch with us because he saw our little George Bush talking head. That was a good example because obviously we couldn't get [Bush] to sit in our recording studio for 15 hours. So we used his weekly radio address, which had both the text and audio on the White House Web site. We downloaded it and put it together, and out came the synthetic George Bush.
We take that audio data and send it off for transcription and then segment it into very small pieces. The technique is similar to the one used by AT&T's NaturalVoices of selecting different pieces, or phonemes, of the audio and stitching them back together in clever ways. The trick is stitching them back together so they don't sound like they came from different context and different words. And obviously now Roger can say anything he likes--he's not restricted to the words he used in his DVD commentaries.
Q: It sounds like that process would be fairly time-consuming--taking apart the words into their individual phonemes and putting them back together to form a whole vocabulary. How challenging is that?
Pidcock: It's pretty tricky to do the initial segmentation, to chop things up into the right pieces. And we've put a lot of work into automating that as much as possible. In the old days, we used to have to check many hundreds of thousands of boundaries between sounds to make sure they were correct. We now do nearly all of it automatically. So once we have the audio and the text, we can pretty much put a voice together overnight. The generating of the speech goes quite quickly. There are clever algorithms we use. We have usually 50 to hundreds of examples of each sound. So you have to pick a path that's optimal between many, many hundreds of thousands or millions of different options. But that's the common technique used in speech recognition as well.
With Roger Ebert's voice, it's been more tricky because we tend not to use material that's more conversational. In normal conversation, people tend to stop and repeat themselves and laugh and cough. That's actually quite difficult because we have to painstakingly extract all the uhs and ums so that the acoustic material we get is real speech. With our recording talent in the voice studio, if they say um in the middle of a sentence, we make them do it again, which we can't do with Roger.
Q: What were some of the other the challenges in creating Roger's voice?
Pidcock: Getting the audio has been a bit of a challenge, more for him than for us. For the DVD commentaries, we needed a version that was quite clean with just his speech. Obviously, the version on the DVD is usually mixed with the background audio of the film. So he's been trying to get people to dig out the original audio tracks because there really isn't any way of extracting the soundtrack. There's no way to do that in a way that doesn't also subtract from the speech.
Once we got hold of his audio, we sent it out for transcription. That was quite tricky because it's difficult to spot all the ums and other audio disfluencies. So one of our challenges has been to try and automatically find these disfluencies. Also, a big problem with the audio we're getting is they're from different recording environments. One of the things we do when we record our own voices is keep the environment very, very consistent. We always use the same studio and microphone. We take photos of all the equipment so we can make sure all our levels are the same. But with Roger's DVD commentaries, they could have been recorded years apart in different environments--some recorded in his house, some recorded in a professional studio. It's a challenge to blend that audio in a consistent way.
Also, the way he speaks is more conversational in these commentaries than what we're used to. That means his speech is more varied. It's possible in the synthesis that we might try to stick in a little vowel from a totally different studio recorded 10 years later. So smoothing all that together is a lot more challenging than it is for the voices we record ourselves.
Q: Do you have typical text that people in your studio read from, similar to what we might see in a voice dictation program?
Pidcock: Yes, we have a big database of text, which we basically mine for combinations of sounds that are fairly rare in English. We end up with a lot of phrases with the word "oil" because in British English there aren't many contexts with "oy" in them. We end up with strange sentences like "The Omaha Oil company went down by 10 points today" because "ah" and "oy" don't go together very much. Over the years, we've been able to develop a really good script that tries to cover all the sounds we need in as much richness as possible in as little time as possible.
Q: How much recorded audio would you typically need from someone to create a voice?
Pidcock: For our voices, we use a minimum of 15 hours in the studio, compared with the Roger Ebert voice, which I think is only about four hours in the version we got. But we don't make them do all 15 hours at once. Usually, we do about three hours a day of recordings for a week.
Q: Do you hire professional voiceover people or are these just average people?
Pidcock: It depends. We produce custom voices for people as well. Sometimes if a customer wants a quite young-sounding voice, it's quite hard to get a professional, so we try to find talented amateurs. And often people who do amateur dramatics are quite good at voiceover work. We have a voice on the Web site called Sue who's from a particular area in Central England where they have a very strong accent. It's in Birmingham. That was actually a competition. They were trying to pick an example of this accent because it's kind of dying out. They went out on the streets of the town with a microphone and recorded people and played that on the local radio station and chose the voices they liked. And then we got the voice from that.
Q: So you're getting local dialects?
Pidcock: Yeah. With our Irish voice [Caitlin], we put some work into getting Irish place names local to Ireland. We did the same with the Birmingham voice and our Scottish voices. We try to find some local color.
Q: Are you mainly focused on English voices or have you branched out into any other languages?
Pidcock: We have partners in Germany and Spain. So we have Spanish and Catalan and German voices and also an Austrian/German dialect. And we're just finishing up French and Italian. So we've been proactive in developing more languages. We also recorded our French voiceover reading a lot of English. So her voice can be made fairly multilingual. Her English is actually so good that we're thinking of adding an English voice with a French voiceover, kind of a sexy French, which I think might be quite popular.
Q: How did the product get off the ground? How did your company start?
Pidcock: Edinburgh University is one of the top places in Europe for speech technology. And quite a long time ago, a speech synthesis system was written there, which is kind of embedded in Linux. It's called Festival. And that led to a spinoff from Edinburgh University called Rhetorical Systems, a company that kind of flowered briefly in the Internet boom and then crashed down. A few of us who work for CereProc worked at Rhetorical. And after that company folded, we started again with a more tightknit idea of building up the technology more gradually and hopefully more sustainably.
Q: Where do you see the technology going and what do you hope to achieve with it?
Pidcock: One thing we'd like to do in line with the system I've been talking about is have a Web service where anyone could log on and use your computer microphone to read a small number of sentences. And then it would give you a downloadable voice that you can install on your computer that sounds exactly like you. That kind of thing would only be possible when this technology, called parametric synthesis, is onstream. We've also been working on trying to get a more emotional output into the speech. We have a project to create little animations or talking heads for in-car use. We put some work into creating an American voice that could sound happy or sad or irritated. And we were interested in seeing how that might affect an interaction between a person and their car. Although I'm not sure if you'd want your car to be angry with you.
Q: Roger Ebert's example is interesting if you relate it to other people who have lost their voice. But as you said, the challenge is finding enough recorded audio from them from the past.
Pidcock: Yeah, at the moment it's just too difficult or too expensive. But we are working on different techniques to enable us to build a voice from a smaller amount of audio. There are new text-to-speech techniques coming up that would actually make a model of the speaker. They don't work by chopping up bits of speech into small pieces and stitching them together. You actually train the model on the speaker's sounds so it can commit those sounds. You can take a general male model of speech, say an American general male voice built from lots of different American male speakers. And you can adapt it to sound like Roger Ebert with quite a small amount of material, maybe half an hour or so.
The problem with these voices is that they don't sound very natural at the moment. They're a bit noisy and a bit buzzy, and they don't have the variation. In the intonation, they don't sound as natural. But potentially, they could be good for people who only have a small amount of audio. It'll still produce something that sounds like them. That's not ready for production. That's still something we're experimenting with.
Updated 2:15 PST to correct spelling of Edinburgh and change company's location to Scotland.