• On The Insider: Britney's Bikini-Clad Top 10
July 16, 2008 4:00 AM PDT

ReCaptcha: Reusing your 'wasted' time online

by Stefanie Olsen
  • Font size
  • Print
  • 8 comments

ZURICH, Switzerland--Chances are that if you've solved one of those distorted-word tests to secure an account with Facebook, Craigslist, or Ticketmaster, you've helped The New York Times inch a little closer to digitizing its entire print newspaper archive from 1851 to 1980.

How have you unwittingly helped the Gray Lady by wasting 10 seconds on a computer-generated word challenge? It's thanks to a year-old initiative called ReCaptcha, a play on the antispam tests known as Captchas (Completely Automated Public Turing Test To Tell Computers and Humans Apart), a test that people can pass, but machines cannot.

People typically fill out Captchas so Web sites can verify that a human, rather than a spam bot, is behind the request for a new e-mail address, log-in, or membership. But with ReCaptchas, which are double-word tests, humans are also helping machines better recognize faded-ink or blurry words that have been digitally scanned from old newspapers or books--text that's difficult for a computer to recognize optically. That way, people will eventually be able to sift through print archives with a more intelligent search engine.

Luis von Ahn, assistant professor in the computer science department at Carnegie Mellon University, created ReCaptcha.

(Credit: Stefanie Olsen/CNET News)

In the last year, as many as 600 million people have completed at least one ReCaptcha on sites such as Twitter, LastFM, and Ticketmaster, which use the technology for free, according to ReCaptcha creator and Carnegie Mellon University assistant professor Luis von Ahn.

With all those helping hands, von Ahn expects that The New York Times digitization project will be finished by the end of 2009, at the latest. (About five months ago, The New York Times paid an undisclosed sum to von Ahn's CMU team to complete its project.)

"We're reusing wasted human cycles," von Ahn, 28, said while speaking at a robotics conference here recently.

The venture involves putting millions of eyes on words printed in roughly 47,000 newspapers, with various counts of pages. For example, before the turn of the century, The New York Times was about one-fourth the breadth it is today. It's doubled in size about every 50 years or so since its beginning in the 1850s, when it was published every day except Sunday. (The New York Times did not immediately respond to a request for comment for this story.)

Von Ahn's team is also helping the Internet Archive with the digitization of books through ReCaptcha, but it's doing that project gratis.

In fact, von Ahn, a recipient of the MacArthur Fellowship (or "genius award") in 2006 for his work as a computer scientist, only wants to aid projects that work for the good of humanity. His main work-related guilt, it seems, is that he helped invent Captchas in the first place (in 2000, so that Yahoo could fend off spammers). And that's only because he's factored how much time people have wasted on the four- to six-character tests. He's estimated that people type 200 million Captchas every day around the world, or a collective estimate of 500,000 man hours (at 10 seconds per puzzle).

But that lost time is nothing compared with the amount spent on games--another key focus for von Ahn. By the time the average American has turned 21, researchers estimate that he or she has spent about 10,000 hours playing video games--that's the equivalent of holding down a full-time job for five years. In 2003, players collectively spent 9 billion human hours on the game Solitaire. In contrast, building the Empire State Building took only 7 million human hours, or the equivalent of a collective 6.8 Solitaire hours.

A slide from von Ahn compares the time people spend on games vs. the time spent constructing major physical structures.

(Credit: Stefanie Olsen/CNET News)

Such thoughts spurred von Ahn to create Games with a Purpose, or Gwap.com, a project designed to harness people's time having fun to solve bigger computational problems. (The field is known as human computation.) He developed the first of those games, the ESP Game, several years ago to tackle image labeling to improve Web search. The game asks two randomly paired people (on different computers) to describe the same image without any way to communicate. Within a time limit, the players must predict the same word for an image before moving onto another image.

It's infectious. As many as 200,000 players have provided 50 million labels for images since the game was created, according to von Ahn. Some people play as much as 20 hours a week.

Normally, companies like Google or Yahoo would need to hire people to label the millions of images in their archives. But with only 5,000 people playing the ESP Game simultaneously, they could label all of Google's image archive within two months, he said. That must be why Google licensed the ESP Game from von Ahn and Carnegie Mellon University in 2006 to label its images.

Even though it would seem Google has completed its image labeling, it's really a never-ending project because of a constant influx of photos and people's changing perceptions.

For example, people's perceptions of celebrities like Britney Spears or political figures like George Bush morph over time. Just two years ago, labels for Britney Spears were as simple as "Britney" and "hot." But recently, they turned into "crazy," "shaved head," and "rehab." President Bush's tags have gone from "George" and "President," to "dumb" and "yuck."

Thanks in part to the success of the ESP Game, von Ahn and a team of 10 computer scientists at CMU have launched four new games to solve different artificial-intelligence problems. Gwap.com, introduced in May, is the umbrella site for all five games, which include the new Verbosity, Tag a Tune, Squigl, and Matchin. Since May, the site has attracted about 85,000 registered users.

Tag a Tune, for example, is much like the ESP Game, but for audio recordings. A player must figure out if he or she is listening to the same song as an opposing player by watching their descriptive guesses and making guesses of their own.

There's a 50 percent chance players are listening to the same song. That game would help describe the contents of audio recordings in a way that someone could eventually ask a search engine for a "happy song about rainy days," rather than using the exact song title. Squigl asks players to outline an object they see in a photo--a task meant to eventually further the field of computer vision.

Next up: von Ahn plans within the next three months to introduce a game that deals with labeling video clips. That way, the system would improve search over video archives. It currently doesn't have any other licensors for its games, although it's easy to see a host of interested parties for audio, music, and video labels.

In a bit of procrastination of his own, von Ahn had been thinking about how not to waste time with games, and then Captchas, at least two years before he acted on a project to recoup energy spent on word tests. He's certainly seen some weird things since he helped get them started on Yahoo in 2000.

A slide from von Ahn illustrates the estimated time an average person spends on various activities. If you calculate the time an average American has spent solving Captchas, it might work out to be 1.9 seconds per day, according to von Ahn.

(Credit: Stefanie Olsen/CNET News)

HotorNot.com, for example, has shown prospective account holders images of nine women and they must pick from the selection which three are "hot." Von Ahn said that through this exercise, a man met his wife on the site.

Spammers have also created so-called Captcha sweatshops to get around the tests. He said that they will hire people for an hourly wage of $2.50 and the average worker will solve about six word puzzles per minute. Even though Captcha sweatshops generate new jobs, von Ahn said he would rather put people's time to better use.

"I started thinking about how you could direct people's efforts in a way that's good for humanity," he said.

Last year, von Ahn introduced the ReCaptcha free antispam system with a double-word test (six to eight characters each), which, it turns out, doesn't take people any longer than solving many single-word tests that mix characters, he said. With two words, the system can develop a confidence rating for the human by serving up one word the computer doesn't know, with another it does know.

Digitizing books or old newsprint is a worthy chore for von Ahn. Typically, if you print something, then scan it, the computer's optical character recognition would be able to "see" the text with 100 percent accuracy. But for older works, with faded ink or warped letters, OCR will not detect the words with accuracy. Recaptcha, which literally shows words scanned from old New York Times newsprint or books in the queue for the Internet Archive, uses people's intelligence in this process.

From blogs like Wordpress and sites like Craigslist, Recaptcha is digitizing between 15 million and 16 million words a day. Sometimes, however, the automated system generates offbeat combinations of words, such as "bad" and "Christians," or "damn" and "liberal."

As for clients other than The New York Times? Von Ahn said he's been approached by at least one bank that wanted to digitize checks, but he turned that offer down.

"We want to do stuff with the preservation of important material," he said.

Recent posts from Digital Media
Another (loud, fuzzy) peek at Wired's tablet edition
Can Facebook group change World Cup game result?
Techmeme Mobile launches for iPhone, Pre, Droid
Sony planning new online store
HDMI products to get meaningful labels
eBay sets Skype loose at $2.75 billion valuation
Facebook becomes third most popular video site
Twitter now asks, 'What's happening?'
Add a Comment (Log in or register) (8 Comments)
  • prev
  • 1
  • next
by sadchild July 16, 2008 5:30 AM PDT
"But with only 5,000 people playing the ESP Game simultaneously, they could label all of Google's image archive within two months." This statement does not clarify how many hours a day these people would be playing the game. Are they playing for an hour a day? Was this calculated based on "there are 24 hours in a day"? Of course, nobody's going to play this game for 24 hours a day for two months straight. This statement has incomplete information and therefore is meaningless.
Reply to this comment
by pjhenry1216 July 16, 2008 6:07 AM PDT
Its not completely meaningless because you can get an idea of how much more efficient this is than hiring people to label images themselves. Obviously, they wouldn't hire 5000 people, so it would take them much, MUCH longer than that. It's just to give you a general idea of how long it'd take. You can get an estimated order of magnitude so to speak. Beyond that, I don't think many people are worrying about this statement anyway. Take it for what it says. Its an estimate.
by silenthorn July 16, 2008 7:33 AM PDT
What is meant by this statement is that at any given time, 5000 people are playing the ESP game. It does not mean that each person is playing 24/7, just that there are always at least 5k people playing. The statement might not be perfectly clear to everyone, but it is neither incomplete nor meaningless.
by Penguinisto July 16, 2008 7:15 AM PDT
FWIW, vBulletin comes with reCAPTCHA as an option now.
Reply to this comment
by dyakoubian July 16, 2008 8:09 AM PDT
I'm amazed and appreciate how Von Ahn is putting what seems like a very good brain and his time and energy to projects that do seem to benefit mankind. Who would ever have thought .....
Reply to this comment
by Jimmu411 July 16, 2008 1:25 PM PDT
How about an "IsThisSpam?" game to figure out how people can recognize 97 different ways of mangling the word Viagra when spam filters can't?
Reply to this comment
by umanpowered January 24, 2009 6:11 PM PST
In the slide, you stated that "9 Billion of Human-Hours were played in 2003" and that it took "7 million Human-Hours" to build the Empire State Building. Then you stated that this is equivalent to "6.8 hours of solaitire" 7 million into 9 billion --- the number of hours of Empire State Bldg into Solatire hours --- is a fraction of 7 mill to 9 bill. this yields less than 1 hour of solaitre a year (0.000777777778). How did you reach 6.8 hours?
Reply to this comment
by thedammac May 8, 2009 5:30 AM PDT
Dear Mr. Von, I am a human and I sure am not a hacker.I have run into a problem with Captcha , I cannot seem to put the verification code in the box at all. will not print or acknowledge what i am printing. I am soso on the computer ways and have found this to be a nusience. I am giving you my email,dangrous move, and in hopes you will give me a great solution. thank you from miss Lola
thedammac@yahoo.com
Reply to this comment
(8 Comments)
  • prev
  • 1
  • next

The 411 on early-termination fees

Verizon Wireless has doubled its early-termination fees for smartphones, but what does it mean for the rest of the industry?

Google has its own plan for Netbooks

No, the search giant isn't saying it will build a Netbook. But it sure knows what it would like one running Chrome OS to resemble, and that's a little different from the Netbook of today.
• Screenshot tour of Chrome OS

About Digital Media

The Web is now the place to go for news and entertainment. Look here for the latest on blogs, music, video, virtual worlds, social networking and more.

Add this feed to your online news reader

Digital Media topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right