Google dips toes into 'deep Web' search
Google's ever-active search bots, which scour the Web constantly for new pages, have begun a new, more active phase of their indexing jobs.
In a blog post Friday, Jayant Madhavan and Alon Halevy of Google's crawling and indexing team said the company has begun an experiment in which its indexing software experimentally enters text in Web site forms to see what previously undiscovered pages may appear.
"In the past few months, we have been exploring some HTML forms to try to discover new Web pages and URLs that we otherwise couldn't find and index for users who search on Google," they wrote. "This experiment is part of Google's broader effort to increase its coverage of the Web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines."
The new Google indexing practice involves only "high quality" Web sites and doesn't run on sites with "robots.txt" files or other standard mechanisms of warding off indexing software.
To decide what words to "type" into the forms, the indexing software samples from among words on the Web page with the form, Google said.
The technology looks related to a company called Transformic that Google acquired, according to a blog post by Anand Rajaraman, who was involved with the technology earlier in his career, while working for Halevy.
Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank. 





- Sketchy...
- by 47project April 15, 2008 9:53 AM PDT
- Just think about how many HTML forms do not check their referrer or protect their data effectively for injection by spammers/bots. It's kinda scary to think about how Google can potentially get into data that may have not been intended for the public eye but was accessed by them because of it's insecure implementation.<br /><br />Make sure you have web developers that know what they're doing! Eeeek.
- Like this Reply to this comment
-
(6 Comments)