August 24, 2007 5:00 AM PDT

Feds use robots.txt files to stay invisible online. Lame.

by Declan McCullagh
  • Font size
  • Print
  • 9 comments

I noticed, when writing a story on Thursday about the bizarre claims by National Intelligence Director Mike McConnell, that the DNI is trying to hide from search engines. Its robots.txt file says, simply:

User-agent: *
Disallow: /

That blocks all search engines, including Google, MSN, Yahoo, and so on, from indexing any files at the Office of the Director of National Intelligence's Web site. (Here's some background on the Robots Exclusion Protocol if you're rusty.)

So I figured it would be interesting to see what other fedgov sites did the same. I wrote a quick Perl program to connect to federal government Web sites, check for the presence of a broad robots.txt exclusion, and report the results. By way of disclaimer, it's the same database I used in an article from early 2006, so it's probably a bit out-of-date.

The government sites that mark themselves as entirely off-limits via robots.txt:

http://www.dni.gov/robots.txt
https://gits-sec.treas.gov/robots.txt
http://thomas.loc.gov/robots.txt
http://www.erl.noaa.gov/robots.txt
http://www.nwd.usace.army.mil/robots.txt
http://www.tricare.mil/robots.txt

Some government sites favor one search engine over another (Customs and Border Protection bans all non-governmental search engines except Google; one Army Corps of Engineers site bans Alexa's spider; the Ginnie Mae agency bans Google's image search bot but not, say, Altavista's; the Minority Business Development Agency completely bans all crawlers but Google's; and one Bureau of Reclamation site bans Googlebot v2.1 but allows MSN's bot):

http://cbp.gov/robots.txt
http://www.nad.usace.army.mil/robots.txt
http://www.ginniemae.gov/robots.txt
http://www.mbda.gov/robots.txt
http://www.mp.usbr.gov/

And here are some sites that seem to have had trouble with misbehaving Web crawlers in the past:

http://www.cdc.gov/robots.txt
http://www.glerl.noaa.gov/robots.txt
http://www.usbr.gov/robots.txt
http://www.onr.navy.mil/robots.txt
http://www.senate.gov/robots.txt
http://www.usdoj.gov/robots.txt

Now, I'm the last person to suggest that using robots.txt to cordon off subsets of your Web site is somehow evil. At News.com, we use it to tell search engines not to index our "email story" pages, for instance, and on my own Web site I use it as well. Blocking misbehaving Web crawlers is important and necessary. And robots.txt may be appropriate when a Web site's address changes, which seems to have happened in the case of the National Oceanic and Atmospheric Administration's site in the first chunk of examples above, or when it becomes defunct, which seems to have happened with the Treasury Department's "gits-sec" Web site above.

But why should entire federal offices like the Director of National Intelligence want to remain invisible online? I can think of two reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was discovered by Google and (b) letting the Feds modify a file such as a transcript without anyone noticing. (There have been allegations of the Bush administration altering, or at least creatively interpreting, transcripts before. And I've documented how a transcript of a public meeting was surreptitiously deleted -- and then restored.)

Neither situation benefits the public. In fact, I'd say it calls for a friendly amendment to the Robots Exclusion Protocol: Search engines should ignore robots.txt when a government agency is trying to use it to keep its entire Web site hidden from the public.

Declan McCullagh, CNET News' chief political correspondent, chronicles the intersection of politics and technology. He has covered politics, technology, and Washington, D.C., for more than a decade, which has turned him into an iconoclast and a skeptic of anyone who says, "We oughta have a new federal law against this." E-mail Declan.
Recent posts from Politics and Law
'Don't-be-evil' Google spurns no-evil software
White House appoints cybersecurity chief
U.S. cap and trade looks out of reach in 2010
FTC's new strategy: Kick 'em when they're down
Plurk holding Microsoft's feet to code-copying fire
FTC wants Intel to mend its ways
Biden to unveil $2 billion in broadband grants
FTC pursues Intel on new front: Graphics chips
Add a Comment (Log in or register) (9 Comments)
  • prev
  • 1
  • next
Isn't Google...
by devbost August 24, 2007 5:28 AM PDT
...in the practice of ignoring robots.txt from government web sites? I seem to recall a few occasions where bloggers were able to compare live versions of pages at the White House and elsewhere with what was in the Google cache and highlight differences in things like event transcripts, where you'd think they'd have no legitimate reason to make edits after posting.
Reply to this comment
Crazy Talk v.s Reality
by mstrclark August 24, 2007 5:50 AM PDT
In reality, individual developers probably but the robot file there.

I doubt there is any policy in place regarding search engines and how government sites should prevent being listed.

Your little gab at the government is just crazy talk and that is lame.
Reply to this comment
Require Nothing
by ballssalty August 24, 2007 6:41 AM PDT
I think it's a little ridiculous to start creating a protocol that picks and chooses which websites it can ignore a block on. If a website operator deems it necessary to block web crawlers from indexing a site be it a government site or a business that is their prerogative.

Making the claim the government websites are hiding is deceitful. The websites are not hidden, they're on the web. Just because it's not convenient for someone to search them does not equate to hiding. There are plenty of legitimate reasons to block spiders. Ranging from performance to security.

I agree you can't just trust the government to do no evil, but to assume an evil motive behind every little thing they do borders on paranoia.
Reply to this comment
I agree to an extent
by jelloburn August 24, 2007 7:58 AM PDT
I also believe that transparency in government is a must and
that all government information should be provided to its
citizens in the easiest and most accessible way possible.

Hiding information from search engines is hampering citizens'
attempts at accessing government information. I don't
necessarily believe that the government is hiding information so
that they can change it later, I just think they should be more
accessible and transparent in their actions.
Evil motive?
by declan00 September 4, 2007 12:27 PM PDT
Oh, I'm hardly assuming there's an evil motive. It could be simple incompetence.
Tin Foil Hat
by attilad August 24, 2007 7:06 AM PDT
C|Net News Blogs = Tech Tabloids, apparently. What a waste of bandwidth.
Reply to this comment
Search engines should ignore robots.txt on .GOV websites
by fluffytheecat August 28, 2007 7:09 PM PDT
Obviously someone at each of those government entities has tried to block search engines from their content. Since the content is public information, it should be available to all of the public, especially thru search engines.

What is lame is that the government has tried to restrict our access to the information.

And I am guessing that those of who reacted negatively to this story have ties to the feds. And if you deny it, you confirm it. And if you don't deny it, the allegation stands.

Somehow that reminds me of Monty Python and the witch drownings. Oh well.

:D

.
Reply to this comment
I can understand the reasoning why banning certain bots.
by inachu September 5, 2007 5:51 AM PDT
Many have banned google bots because they are too aggressive. But in this SEO industry many things appear to be hit or miss.
Reply to this comment
Department of History Correction
by disco-legend-zeke September 7, 2007 8:48 AM PDT
The right of the public to watch our government is a basis of American Freedoms.

Except in cases of national security, there is no benefit to the People by hiding web content from search engines.

Maybe a few Freedon of Information act lawsuits are necessary to get these bureaucrats to see the light of day.

I agree, .GOV sites should be spiderable in spite of robots.txt files. Our tax dollars created the sites, we should have easy access to the information, including searchability via our favorite engine.

This is a page straight out of George Orwells "1984"
Reply to this comment
(9 Comments)
  • prev
  • 1
  • next
advertisement

15 sites that went kaput in 2009

Web sites launch all the time, but they also shut their doors. We highlight 15 that bit the dust this year.

Top 10 news stories of the decade

Let the debate begin: Was the iPhone more important than iTunes? Was anything bigger than Google finding a great business model? CNET offers its list of the 10 most important stories of the '00s.

About Politics and Law

News at the intersection of technology, politics, and law, ranging from intellectual property to censorship to tech policy.

Add this feed to your online news reader

Politics and Law topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right