• On GameFAQs: The top 10 strangest game bosses
July 25, 2008 1:21 PM PDT

Google reveals scope of Web-crawling task

by Stephen Shankland

It's a pity the National Security Agency can't talk about its computational challenges, because it's leaving a lot of the boasting rights to Google.

(Credit: Paul Ford)

In a blog posting on Friday the company shared some detail about the challenges of one aspect of its search operation, the Web indexing and processing that must take place before the results are delivered to users. The short version: Google has no choice but to think big.

First comes surfing. "We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links," said software engineers Jesse Alpert and Nissan Hajaj. "Even after removing...exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day."

Next comes analyzing the "link graph"--the mathematical representation of what links to what. That's a key foundation of Google's PageRank algorithm, which brought the company's search engine to prominence by assigning importance to those pages that other important pages point toward.

In the early days of Google, computing PageRank for the company's collection of a mere 26 million pages took a workstation "a couple hours," and the results would be used for some unspecified period of time. Today, Google surfs the Web continuously and recalculates the link graph "several times per day."

"This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections," the engineers said.

Google likes to talk about how users have choice and competition just one click away, and that's a fair point. But the blog post also makes it even clearer just how high barriers to entry are in the search market. That's one of the reasons Yahoo's BOSS (build your own search service) program is intriguing: it lets search start-ups take advantage of Yahoo's crawling, indexing, and search technology in exchange for advertising or revenue-sharing partnerships.

Originally posted at Digital Media
Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank.
Recent posts from Webware
URL shortening is hot--but look before you leap
Marc Andreessen launches new venture fund
4chan may be behind attack on Twitter
Firefox 3.5 and the potential of Web typography
Sites that help you lodge complaints
Google App Engine misfires
Microsoft: Bing needs to improve when news breaks
Google finally sued by makers of Finally Fast
Add a Comment (Log in or register)
by n2d2 July 25, 2008 10:58 PM PDT
How many of these are spam-link filled bot-generated blogspot.com sites? That's what I'd like to know.
Reply to this comment
by madflacker July 26, 2008 8:31 AM PDT
I also wonder how much human touch is required for Google to recognize new pages. It's always spun up as fully automated / machine learning / AI type of fodder. But there are many instances where I've found huge holes in subject matter (in a niche, granted) that isn't on Google's radar. There are plenty of areas - like the free webmaster tools - that foster PEOPLE giving Google new info on new sites and so forth. It would be interesting to know the areas where Google is still heavily reliant on human touch / human oversight. Find those areas and develop more automated approaches, and that seems like another good path for a young co. to get snapped up by Google.
Reply to this comment
advertisement

About Webware

Say No to boxed software! The future of applications is online delivery and access. Software is passé. Webware is the new way to get things done.

Add this feed to your online news reader

Webware topics

Making sense of Windows 7 upgrades

faq The basics and the fine print on Microsoft's options for those eyeing the next operating system from Redmond.
• Full Windows 7 coverage

Road Trip 2009: Big Sky Country

CNET News reporter Daniel Terdiman takes his car full of gadgets to the Rockies and the Great Plains in search of tech, science, nature, and more.
• America's Fortress: Cheyenne Mountain

advertisement

Inside CNET News

Scroll Left Scroll Right