February 1, 2006 4:21 PM PST

How to evade Google search

Dell apparently learned the hard way this week that companies have to be careful to ensure that information they store on the Internet that they want to keep hidden is not automatically added to a search engine index for everyone on the Web to see.

Specifications for future Dell notebooks were accessible via Google's search site before the content was pulled from a Dell FTP site and from Google's cache.

Google, like the other major search engines, has an automated search engine that sends software robots called "spiders" out to crawl the Web and find sites to add to the index of Web sites it maintains. Because the spiders follow links running from one Web site to others, they pick up sites on their own without Webmasters having to manually submit them to search engines.

Webmasters also can provide the URL, or numerical Web address, for pages they want crawled, and they can submit detailed site maps to Google, according to Google's "information for Webmasters" pages.

Webmasters who want to keep some or all of their site private from the Googlebot can put a standard document called "robots.txt" at the root of the server that instructs the crawler not to download content. If the removal request is urgent, the Webmaster can submit a request via Google's automatic URL removal system, but must provide an e-mail address and password first.

Content that has been removed can still be viewed through Google's cache, which is a "snapshot" and archive of each page crawled. Webmasters can prevent pages from being cached by inserting specific code on them.

Webmasters must remember that Google's is not the only search engine crawler they have to worry about. Removing content from Google's cache does not mean that other search engines won't index and cache it.

6 comments

Join the conversation!
Add your comment (Log in or register)
Nice for Dell
They got the cache cleared in a day. Took me a month. Now I know my place in the universe.
Posted by gggg sssss (2292 comments )
Reply Link Flag
in order to get it done fast..
you need to call.. or visit google's headquarters, which dell undoubtedly did
Posted by assman (966 comments )
Link Flag
Google's own tips don't even work
My site got hit by being #1 for the "Iraq flag" Google image search. So I followed Google's webmaster tips for preventing images from being indexed by using a robots.txt file. It was supposed to happen the next time they indexed my site (which happens very frequently for my site - every few days). Instead my site remained #1 for three months afterwards.

While this is not the same as trying to not get indexed in the first place it leads me to think that I probably wouldn't trust all their tips to work as advertised. Who know's maybe they index and cache everyone's content anyway regardless of the "don't index me" hints. Then once in a while stuff accidentally gets into the index, or doesn't get removed quickly. Plus if your stuff is in their cache who knows how long it will stay there for the Department of Justice or anyone else to subpoena them for?

I once had someone contact me about my resume which was on my web server but I new I had never linked to from my site our shared the link with. Clearly the person found the file just by guessing the URL. I wouldn't be at all surprised if Google and others do this kind of thing to uncover hidden content.

For everyone, the working assumption should be, if you don't password protect access to your content then it will probably end up indexed on Google or on some other search engine index one day.
Posted by whogrant (32 comments )
Reply Link Flag
dangerous?
well... the file must be robots.txt, and some people think this can be dangerous, 'cause other ppl can know in an easy way (just retrieving the robots file) where your valuable content is.
Posted by gerardocb (1 comment )
Reply Link Flag
tech republic locksmith
And, don't forget about archive.org
Posted by (1 comment )
Reply Link Flag
USE robots.txt (NOT robot.txt)!!!
This article mis-states basic SEO 101 subject matter regarding
robots.txt, this site accurate info:

<a class="jive-link-external" href="http://www.searchengineworld.com/robots/robots_tutorial.htm" target="_newWindow">http://www.searchengineworld.com/robots/robots_tutorial.htm</a>
Posted by (2 comments )
Reply Link Flag
 

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

Inside CNET News

1-2 of 12

Scroll Left Scroll Right

What's Hot

Discussions

Shared

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.

Markets

Market news, charts, SEC filings, and more

Related quotes

Dell (0.00%) 0.00 17.75
Google (0.00%) 0.00 605.91
Microsoft (0.00%) 0.00 30.50
Yahoo (0.00%) 0.00 16.14
Dow Jones Industrials (0.00%) 0.00 12,801.23
S&P 500 (0.00%) 0.00 1,342.64
NASDAQ (0.00%) 0.00 2,903.88
CNET TECH (0.00%) 0.00 2,032.02
  Symbol Lookup