August 18, 2003 9:15 AM PDT
Project searches for open-source niche
Called Nutch, the project is developing open-source software for locating documents online. But unlike major search providers, it won't cloak its formulas for matching relevant results to visitors' queries. Rather, it will provide an open window into its calculations with links to explanations on how it determined each result, according to lead architect Doug Cutting.
"All of the existing search engines have secret methods for deciding which documents are the best documents," said Cutting, whose resume includes research and development stints at Excite, Grand Central and the Palo Alto Research Center. "Search is something that's a basic need for users of the Internet--it's a valuable tool, and yet it's controlled secretly, and that seems like a bad setup. People have the right to know how their search engine works, so they can trust it."
Nutch itself has been operating secretly for roughly a year, gathering support from developers and funding from one of the biggest commercial players in search: Overture Services.
Two researchers from Overture--an advertising-supported search service Yahoo's in the process of acquiring--approached Cutting last year with interest in providing funding for an open-source search system for academic research. Already itching to work on another search engine, Cutting spearheaded the effort from there, bringing on three founding developers and forming a board of directors that includes Mitch Kapor, founder of Lotus and co-founder of the Electronic Frontier Foundation, and Tim O'Reilly, founder and president of tech book publisher O'Reilly & Associates.
Despite its connection to Overture, the project is not-for-profit and aims to advance search by supplying a technology for experimentation. Academic researchers or developers will be able to download the software and adapt it without having to reinvent the wheel, Cutting said. Foreign governments could use Nutch to develop a noncommercial search site for citizens rather than licensing a proprietary, ad-supported technology, he said. Or corporate entities could build a for-profit business around the technology.
"If this is Linux, we're hoping there would be Red Hat," Cutting said, drawing a comparison with the open-source operating system and one of the leading companies offering it.
Danny Sullivan, editor of industry newsletter Search Engine Watch, said the effort will likely benefit people who want to develop a category or site-specific search engine such as a golf-themed search. But for broad use, an open-source engine would likely be at the mercy of spammers, Sullivan said, adding that this is a key reason indexes such as Google opt for secrecy.
"It's a great idea to have an open-source search engine--much like an open-source directory" such as the Open Directory Project used by America Online and Google, Sullivan said. "But if you do expose all this information (on how it works), it won't be useful, simply because people will spam it."
"It's the equivalent of being a newsroom and saying, 'Anybody can write an article for publication'--no filters," he said. Still, the effort may put some pressure on the commercial engines to keep their practices above board, he added.Searching for the next big thing
Search has become a hotbed for innovation in the last year, as marketers have poured money into ad campaigns that tie their products to specific search terms. Overture and Google have built billion-dollar businesses around ad-supported search, and all the major portals have recommitted themselves to Web navigation as a result. Top computer scientists at the major portals and some academic researchers are devising ways to improve on search for the Internet and a host of applications.
Yahoo's bid to own search
Planned purchase of Overture makes
makes Web portal a top contender
to win online advertising's top crown.
The industry has also undergone much consolidation in the last year, and only a few companies--Google, Yahoo and MSN--are fielding the majority of search traffic worldwide. (Yahoo, for example, last month agreed to spend nearly $1.7 billion to buy Overture.) With fewer and fewer players, the industry has little room for checks and balances, industry watchers say. Sites such as Public Information Research's Google Watch have emerged to try to lend transparency to or raise questions about the company's growing importance in Web search.
Google did not respond directly to criticism that search engine formulas are too secret. But a company representative said Nutch is "yet another effort that demonstrates the value and global interest in search engine technology."
Nutch has already taken the wraps off its downloadable software for research, which is suitable for testing by other developers but likely too arcane for the average Web surfer. It is aiming to have a public site by October that will allow people to search 100 million documents to be used as a measure against indexes such as Google.
For example, a Web surfer could pull up search results from Nutch with transparency to its mathematical calculations and compare them with those from Google, which does not publicize its formula for calculating search results. Nutch is actively seeking funding for hardware that would support traffic from Web surfers, but for now, its systems do not have the capacity to handle an influx of visitors.
Overture would not detail the amount of money it has donated to Nutch. But it said the effort was part of a desire to better "understand the current issues surrounding search and innovative solutions in that area," Overture spokeswoman Jennifer Stephens said.
Shortly after Overture last year founded its own research group, run by Gary Flake, it invested in the open-source search engine for academic research and to enhance its own learning, Stephens said. But since Overture acquired AltaVista and Web search technology from Norway-based Fast Search & Transfer, those technologies have come to be the core of its Web search technology and testing. Nutch is an alternative test bed for the company's use, she said.
Nutch, like other popular Web names, is a nonsense word, and this one originated from Cutting's 2-year-old son, Henry. While searching for a domain name last year, Cutting heard his son pronounce "lunch" as "nutch."
The engine is written in Java and is based on Lucene, a software library that developers can use to add search capabilities to technologies such as e-mail. Nutch builds upon Lucene, also developed in part by Cutting, and uses the technology as its intersearch library and indexing tool. But Nutch is designed to index and crawl the entire Web.
Cutting is particularly concerned about the effects of advertising-heavy search providers. As the engines become laden with links to products and services, that cargo could sway a search for noncommercial data. He's also concerned about U.S. search companies becoming dominant overseas.
"It would be nice if there were an open-source search engine owned by the world."