Feds use robots.txt files to stay invisible online. Lame.
I noticed, when writing a story on Thursday about the bizarre claims by National Intelligence Director Mike McConnell, that the DNI is trying to hide from search engines. Its robots.txt file says, simply:
User-agent: *
Disallow: /
That blocks all search engines, including Google, MSN, Yahoo, and so on, from indexing any files at the Office of the Director of National Intelligence's Web site. (Here's some background on the Robots Exclusion Protocol if you're rusty.)
So I figured it would be interesting to see what other fedgov sites did the same. I wrote a quick Perl program to connect to federal government Web sites, check for the presence of a broad robots.txt exclusion, and report the results. By way of disclaimer, it's the same database I used in an article from early 2006, so it's probably a bit out-of-date.
The government sites that mark themselves as entirely off-limits via robots.txt:
http://www.dni.gov/robots.txt
https://gits-sec.treas.gov/robots.txt
http://thomas.loc.gov/robots.txt
http://www.erl.noaa.gov/robots.txt
http://www.nwd.usace.army.mil/robots.txt
http://www.tricare.mil/robots.txt
Some government sites favor one search engine over another (Customs and Border Protection bans all non-governmental search engines except Google; one Army Corps of Engineers site bans Alexa's spider; the Ginnie Mae agency bans Google's image search bot but not, say, Altavista's; the Minority Business Development Agency completely bans all crawlers but Google's; and one Bureau of Reclamation site bans Googlebot v2.1 but allows MSN's bot):
http://cbp.gov/robots.txt
http://www.nad.usace.army.mil/robots.txt
http://www.ginniemae.gov/robots.txt
http://www.mbda.gov/robots.txt
http://www.mp.usbr.gov/
And here are some sites that seem to have had trouble with misbehaving Web crawlers in the past:
http://www.cdc.gov/robots.txt
http://www.glerl.noaa.gov/robots.txt
http://www.usbr.gov/robots.txt
http://www.onr.navy.mil/robots.txt
http://www.senate.gov/robots.txt
http://www.usdoj.gov/robots.txt
Now, I'm the last person to suggest that using robots.txt to cordon off subsets of your Web site is somehow evil. At News.com, we use it to tell search engines not to index our "email story" pages, for instance, and on my own Web site I use it as well. Blocking misbehaving Web crawlers is important and necessary. And robots.txt may be appropriate when a Web site's address changes, which seems to have happened in the case of the National Oceanic and Atmospheric Administration's site in the first chunk of examples above, or when it becomes defunct, which seems to have happened with the Treasury Department's "gits-sec" Web site above.
But why should entire federal offices like the Director of National Intelligence want to remain invisible online? I can think of two reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was discovered by Google and (b) letting the Feds modify a file such as a transcript without anyone noticing. (There have been allegations of the Bush administration altering, or at least creatively interpreting, transcripts before. And I've documented how a transcript of a public meeting was surreptitiously deleted -- and then restored.)
Neither situation benefits the public. In fact, I'd say it calls for a friendly amendment to the Robots Exclusion Protocol: Search engines should ignore robots.txt when a government agency is trying to use it to keep its entire Web site hidden from the public.
Declan McCullagh, CNET News' chief political correspondent, chronicles the intersection of politics and technology. He has covered politics, technology, and Washington, D.C., for more than a decade, which has turned him into an iconoclast and a skeptic of anyone who says, "We oughta have a new federal law against this." E-mail Declan. 





I doubt there is any policy in place regarding search engines and how government sites should prevent being listed.
Your little gab at the government is just crazy talk and that is lame.
Making the claim the government websites are hiding is deceitful. The websites are not hidden, they're on the web. Just because it's not convenient for someone to search them does not equate to hiding. There are plenty of legitimate reasons to block spiders. Ranging from performance to security.
I agree you can't just trust the government to do no evil, but to assume an evil motive behind every little thing they do borders on paranoia.
that all government information should be provided to its
citizens in the easiest and most accessible way possible.
Hiding information from search engines is hampering citizens'
attempts at accessing government information. I don't
necessarily believe that the government is hiding information so
that they can change it later, I just think they should be more
accessible and transparent in their actions.
What is lame is that the government has tried to restrict our access to the information.
And I am guessing that those of who reacted negatively to this story have ties to the feds. And if you deny it, you confirm it. And if you don't deny it, the allegation stands.
Somehow that reminds me of Monty Python and the witch drownings. Oh well.
:D
.
- Department of History Correction
- by disco-legend-zeke September 7, 2007 8:48 AM PDT
- The right of the public to watch our government is a basis of American Freedoms.
- Like this Reply to this comment
-
(9 Comments)Except in cases of national security, there is no benefit to the People by hiding web content from search engines.
Maybe a few Freedon of Information act lawsuits are necessary to get these bureaucrats to see the light of day.
I agree, .GOV sites should be spiderable in spite of robots.txt files. Our tax dollars created the sites, we should have easy access to the information, including searchability via our favorite engine.
This is a page straight out of George Orwells "1984"