July 25, 2008 1:21 PM PDT

Google reveals scope of Web-crawling task

by Stephen Shankland
  • Font size
  • Print
  • 2 comments

It's a pity the National Security Agency can't talk about its computational challenges, because it's leaving a lot of the boasting rights to Google.

(Credit: Paul Ford)

In a blog posting on Friday the company shared some detail about the challenges of one aspect of its search operation, the Web indexing and processing that must take place before the results are delivered to users. The short version: Google has no choice but to think big.

First comes surfing. "We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links," said software engineers Jesse Alpert and Nissan Hajaj. "Even after removing...exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day."

Next comes analyzing the "link graph"--the mathematical representation of what links to what. That's a key foundation of Google's PageRank algorithm, which brought the company's search engine to prominence by assigning importance to those pages that other important pages point toward.

In the early days of Google, computing PageRank for the company's collection of a mere 26 million pages took a workstation "a couple hours," and the results would be used for some unspecified period of time. Today, Google surfs the Web continuously and recalculates the link graph "several times per day."

"This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections," the engineers said.

Google likes to talk about how users have choice and competition just one click away, and that's a fair point. But the blog post also makes it even clearer just how high barriers to entry are in the search market. That's one of the reasons Yahoo's BOSS (build your own search service) program is intriguing: it lets search start-ups take advantage of Yahoo's crawling, indexing, and search technology in exchange for advertising or revenue-sharing partnerships.

Originally posted at Digital Media
Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank.
Add a Comment (Log in or register)
by n2d2 July 25, 2008 10:58 PM PDT
How many of these are spam-link filled bot-generated blogspot.com sites? That's what I'd like to know.
Reply to this comment
by madflacker July 26, 2008 8:31 AM PDT
I also wonder how much human touch is required for Google to recognize new pages. It's always spun up as fully automated / machine learning / AI type of fodder. But there are many instances where I've found huge holes in subject matter (in a niche, granted) that isn't on Google's radar. There are plenty of areas - like the free webmaster tools - that foster PEOPLE giving Google new info on new sites and so forth. It would be interesting to know the areas where Google is still heavily reliant on human touch / human oversight. Find those areas and develop more automated approaches, and that seems like another good path for a young co. to get snapped up by Google.
Reply to this comment
advertisement

About Webware

Say No to boxed software! The future of applications is online delivery and access. Software is passé. Webware is the new way to get things done.

Add this feed to your online news reader

Webware topics

15 sites that went kaput in 2009

Web sites launch all the time, but they also shut their doors. We highlight 15 that bit the dust this year.

Top 10 news stories of the decade

Let the debate begin: Was the iPhone more important than iTunes? Was anything bigger than Google finding a great business model? CNET offers its list of the 10 most important stories of the '00s.

Inside CNET News

Scroll Left Scroll Right