September 5, 2006 6:58 AM PDT
Google comes to HP's aid
- Related Stories
Adobe adds bar codes to PDF formsMarch 8, 2004
Short Take: ScanSoft launches new version of imaging softwareMay 13, 1998
HP courts small business usersMay 27, 1997
Google engineers apparently have in their work reviving an old indexing engine developed and left to rust by Hewlett-Packard.
The search giant announced that it's helped fix software bugs in the 2-decades-old Tesseract, an optical character recognition (OCR) engine originally built by HP Labs and retired in 1995 before the company released the code to the open-source community in recent months.
Why is Google interested in OCR? According to the company, which posted the news Thursday on its code page: "In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing."
The project dovetails with Google's overall goal to index and organize the world's information--everything from campy high school videos to academic papers that have yet to be digitized. With open-source technology like Tesseract, other engineers or institutions could help digitize more information in the form of papers.
Google helped with the project at the behest of engineers at the University of Nevada at Las Vegas, who have been working with HP to clear the dust off Tesseract in the last two years. UNLV turned to Google to help fix several bugs in the old software, which in its day was one of the most accurate character recognition engines.
Tesseract was judged to be highly accurate in reading paper documents in a UNLV contest in 1995, before HP retreated from the OCR business and put the software into storage.
"Fortunately some of our esteemed HP colleagues realized a year or two ago that rather than sit on this engine, it would be better for the world if they brought it back to life by open sourcing it," Google said.
For the record, bit rot is typically jargon in the computing world for a gradual decay of storage media or buggy software, according to Wikipedia. In literal terms, there's no rust involved.