• On CHOW: Can nutmeg make you hallucinate?
July 25, 2008 1:21 PM PDT

Google reveals scope of Web-crawling task

by Stephen Shankland
  • Font size
  • Print
  • 2 comments

It's a pity the National Security Agency can't talk about its computational challenges, because it's leaving a lot of the boasting rights to Google.

(Credit: Paul Ford)

In a blog posting on Friday the company shared some detail about the challenges of one aspect of its search operation, the Web indexing and processing that must take place before the results are delivered to users. The short version: Google has no choice but to think big.

First comes surfing. "We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links," said software engineers Jesse Alpert and Nissan Hajaj. "Even after removing...exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day."

Next comes analyzing the "link graph"--the mathematical representation of what links to what. That's a key foundation of Google's PageRank algorithm, which brought the company's search engine to prominence by assigning importance to those pages that other important pages point toward.

In the early days of Google, computing PageRank for the company's collection of a mere 26 million pages took a workstation "a couple hours," and the results would be used for some unspecified period of time. Today, Google surfs the Web continuously and recalculates the link graph "several times per day."

"This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections," the engineers said.

Google likes to talk about how users have choice and competition just one click away, and that's a fair point. But the blog post also makes it even clearer just how high barriers to entry are in the search market. That's one of the reasons Yahoo's BOSS (build your own search service) program is intriguing: it lets search start-ups take advantage of Yahoo's crawling, indexing, and search technology in exchange for advertising or revenue-sharing partnerships.

Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank.
Recent posts from Digital Media
'Jurassic Park' kid cast as Facebook co-founder
Farewell, triangles: AOL preps its post-Time Warner look
Report: Microsoft may help News Corp. delist sites
The Black Friday deals that aren't
Has Twitter peaked?
Another (loud, fuzzy) peek at Wired's tablet edition
Can Facebook group change World Cup game result?
Techmeme Mobile launches for iPhone, Pre, Droid
Add a Comment (Log in or register)
by n2d2 July 25, 2008 10:58 PM PDT
How many of these are spam-link filled bot-generated blogspot.com sites? That's what I'd like to know.
Reply to this comment
by madflacker July 26, 2008 8:31 AM PDT
I also wonder how much human touch is required for Google to recognize new pages. It's always spun up as fully automated / machine learning / AI type of fodder. But there are many instances where I've found huge holes in subject matter (in a niche, granted) that isn't on Google's radar. There are plenty of areas - like the free webmaster tools - that foster PEOPLE giving Google new info on new sites and so forth. It would be interesting to know the areas where Google is still heavily reliant on human touch / human oversight. Find those areas and develop more automated approaches, and that seems like another good path for a young co. to get snapped up by Google.
Reply to this comment
advertisement

The 411 on early-termination fees

Verizon Wireless has doubled its early-termination fees for smartphones, but what does it mean for the rest of the industry?

Google has its own plan for Netbooks

No, the search giant isn't saying it will build a Netbook. But it sure knows what it would like one running Chrome OS to resemble, and that's a little different from the Netbook of today.
• Screenshot tour of Chrome OS

About Digital Media

The Web is now the place to go for news and entertainment. Look here for the latest on blogs, music, video, virtual worlds, social networking and more.

Add this feed to your online news reader

Digital Media topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right