September 29, 2004 11:58 AM PDT
IBM's 'Marvel' to scour Net for video, audio
Researchers at Big Blue are attempting to create a search engine, code-named Marvel, that will retrieve video and/or audio clips that for the most part can't easily be retrieved today on the Internet.
Ideally, a person in the future will be able to click on a sample shot of, say, a presidential debate, or describe a scene ("two guys, podiums"), and get back relevant clips from the thousands of hours of audio and video that gets generated by broadcasters, film studios and, conceivably, individuals every year.
Marvel, a prototype search technology from IBM, can pluck sought-after scenes from thousands of hours of video.
Search is going way beyond text to images, audio and film. This will greatly expand the information that can be found on the Internet.
Though current search engines like Google and Yahoo can serve up video clips or images, they really aren't searching on the images contained in the files. Instead, they rely on the text attached to the bottom of the files, and thus they search only the small number of files that have been properly identified.
"To be able to index the content now requires manual labeling of the content," said John R. Smith, senior manager of intelligent information management at IBM Research. "We're trying to index content without using text or manual annotations."
Manual labeling simply takes too much effort. Thirty minutes of video footage might take five hours to parse and classify.
Worse, the information that needs to be classified is exploding. The How Much Information? survey conducted by the University of California at Berkeley determined that television stations worldwide produced about 123 million hours of total programming in 2002. Of the total, 31 million hours represented original programming, which translates to 70,000 terabytes of data. That doesn't include video from security cameras or home movies.
In contrast to manual labeling, Marvel is designed to automatically categorize (and subsequently retrieve) clips using modifiers like "outdoor," "indoor," "cityscape" or "engine noise" that describe the action in the clip.
The Marvel research team, which is working on the project with libraries and a few select news organizations, such as CNN, showed off the first prototype at a conference at Cambridge University in late August. The prototype system can scan through a database of more than 200 hours of broadcast news video and use 100 different descriptive terms to classify and identify scenes. IBM hopes to come up with a list of 1,000 descriptive labels by April.
A query takes about two to three seconds. Marvel is based on the MPEG-7 data format, but it can search on any standard video format. (IBM has posted examples of some search results here.) IBM has not discussed how Marvel could be turned into a product, but releasing it for use inside the television industry seems a more likely first step than promoting it for consumers.
Big Blue is one of a number of institutions attempting to push the boundaries on retrieval technology. Purdue University showed off a search engine earlier this year that will search on a 3D sketch. Others, meanwhile, are working on software that will more efficiently search on items in a limited range of topics, such as art and antiques.
Marvel largely relies on a technology called support vector machines, pioneered by Vladimir Vapnik at AT&T about a decade ago. In this type of artificial intelligence, a computer learns to assign the equivalent of a yes or no value to a piece of data. In other words, If the computer is supposed to distinguish between an indoor or outdoor scene, trees in a shot could well prompt the computer to put the clip in the outdoor bucket.
"It is a statistical technique that looks to define boundaries between concepts," Smith explained.
Most other search engines use a form of Bayesian networks that provide a spectrum of probable to improbable answers. Hypothetically, a Bayesian image search would retrieve a shot of a hotel lobby that contained potted trees, whereas a search based on support vector machines might overlook it, having relegated trees to the outside category.
Unfortunately, even short video clips are crammed with data. The Marvel group has identified 166 different dimensions for searching by color, and that list was honed from a much larger list of possible color dimensions. "The traditional approach would have a hard time in this space," said Smith.
To compensate, Marvel will perform a multimodal search, scanning both the audio and video tracks. Some of the early results are promising. Smith and his group, for instance, conducted a search for rocket launches. By searching only on video, the system retrieved rocket scenes, but also sky shots, planes, helicopters and similar clips. Searching on audio brought up airplanes and crowd noise.
When relying on both audio and video, the system retrieved 70 news clips on rocket launches.
Although Marvel will ideally be able to automatically classify video and audio files as it sees fit, the project is currently in a kind of manual labor phase. IBM has formed a committee with CNN, the BBC and organizations like the Getty library to gather a corpus of video files. By April, the group hopes to have a list of 1,000 terms for classifying clips. Some will be relatively generic descriptors ("landscape") while many could be specific ("tennis," "basketball" and so on).
"One thousand will not exhaust the whole semantic space," Smith said.
A full-fledged, functional Marvel based search engine may not come out for another three to five years.