• On MovieTome: The 10 worst movies of 2009 so far!
January 21, 2009 6:00 AM PST

Much ado about Whitehouse.gov's new openness

by Declan McCullagh
  • Font size
  • Print
  • 46 comments

Fans of President Barack Obama, or perhaps just those who dislike former President George W. Bush, seem to think there's something notable about the way the new White House Web site is configured to deal with search engines.

That configuration file is called robots.txt. It's designed to let Webmasters ask search engine robots not to include certain areas of a Web site in their index. Well-behaved robots will comply.

The Obama revamp of Whitehouse.gov included a shorter robots.txt file, which Thenextweb.com called "a sign of greater transparency and change." A BoingBoing poster claimed that now "people can find information that was restricted before." And so on.

There's just one problem with these comments. They're wrong. As of Tuesday morning, the Bush administration's robots.txt file did only two things: first, it pointed search engines to the high-graphics versions of the page, as opposed to the text-only versions, and second, it tried to keep type-in-your-search-query pages from being indexed.

Those are legitimate reasons to list those pages in robots.txt, which is why CNET's own file is relatively long and complicated too. (Sites that have been around for eight years or longer tend to get that way). We ask search engines not to index an "/Ads" directory, e-mail-this-story pages, and dozens of others. The Democrat-controlled House and Senate have--gasp!--substantial robots.txt files too.

It's true that in 2007, the Bush White House did block some files they should not have, which they fixed once I brought it to their attention. They also fixed a more serious problem with the Director of National Intelligence's Web site, and an earlier problem in 2003. (A better solution would be for search engines to ignore overly broad robots.txt files on .gov and .mil sites, including Thomas.loc.gov.)

If anything, Obama's robots.txt file is too short. It doesn't currently block search pages, meaning they'll show up on search engines--something that most site operators don't want and which runs afoul of Google's Webmaster guidelines. Those guidelines say: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

And here's something sure to upset Obama-praising geeks: the new White House site doesn't pass the litmus test of good HTML design. Alas, according to the W3C, not all pages successfully validate. Those are your tax dollars at work.

P.S.: The White House seems to be using Akamai's Edge Platform for scalable Web hosting:

sh-2.05b$ host whitehouse.gov
whitehouse.gov has address 96.6.250.135
whitehouse.gov mail is handled by 105 mailhub-wh3.whitehouse.gov.
whitehouse.gov mail is handled by 100 mailhub-wh2.whitehouse.gov.
sh-2.05b$ host www.whitehouse.gov
www.whitehouse.gov is an alias for www.whitehouse.gov.edgekey.net.

www.whitehouse.gov.edgekey.net is an alias for e2561.b.akamaiedge.net.
e2561.b.akamaiedge.net has address 96.16.218.135
sh-2.05b$ 

Declan McCullagh, CNET News' chief political correspondent, chronicles the intersection of politics and technology. He has covered politics, technology, and Washington, D.C., for more than a decade, which has turned him into an iconoclast and a skeptic of anyone who says, "We oughta have a new federal law against this." E-mail Declan.

advertisement
 
Business supplies and services can get expensive. Get smart spending tips and learn about new cost-saving opportunities for your business
Recent posts from Politics and Law
Confidential 9/11 pager messages disclosed
IBM staffer posts pics on Facebook, loses benefits
Congress may probe leaked global warming e-mails
Spain mandates affordable broadband for all
Town to photograph every car that enters and leaves
Dot-com thinking for D.C.: Expert Labs debuts
FCC discusses barriers to national broadband plan
What Intel just bought for $1.25 billion: Less risk
Add a Comment (Log in or register) Showing 1 of 2 pages (46 Comments)
by citizencontact January 21, 2009 7:49 AM PST
Although I agree very strongly that government web sites should be very open, it is unfair to confuse information that is human readable with machine processable. I agree that the all web pages and sites should be better designed so that they can be used by software by using sitemap standards (robots, sitemap, URL discovery tools generally) and that all pages should be in valid XHTML or XML. However, most web designers not to mention the teeming masses, do not understand that if the web page looks good in web browsers that that is not enough. Furthermore most tools do a lousy job of forcing good web practices. The CIA uses Plone for its content management which is one of the few tools that generally enforces good practices.

So lets hope that the US Government pushes for only allowing tools that fully comply with Section 508 and all those practices that make web sites human readable and machine processable. Italy has cracked down on bad tools and bad sites for government web sites. Here in the US, GSA should enforce the same rules and not allow any government tools that allow for bad practices.

You didn't mention not allowing server specific URLs which is another very bad practice (e.g. pages that expose the .php, .aspx, .asp, .cfm, etc within URLs). Or having XML files without stylesheets for human readability (including RSS feeds). Or moving forward with new standards, like microformats. I hope that this new administration understands the possibilities of having all site human and machine readable, and to understand how this provides the maximum openness.
Daniel Bennett
http://www.advocatehope.org/
Reply to this comment
by codesmith January 21, 2009 7:51 AM PST
Well good grief, aren't you just a cheery fellow! The tone of your article makes me think you were just scavenging around looking for bones to pick. What you say may be true, but it's hardly worth an article.
Reply to this comment
by declan00 January 21, 2009 8:11 AM PST
Nope. I wouldn't have written about this (I wrote an article about whitehouse.gov yesterday that didn't mention robots.txt, though I had looked at the file) if there hadn't been misinformation floating around.
by jahf January 21, 2009 8:11 AM PST
One of the most nitpicky subjects I've seen in a long time.
Reply to this comment
by MSSlayer January 21, 2009 8:47 AM PST
Not only nitpicky but laughably ironic.

You might want to get on those inept morons CNET hired to design their webpages.

This page full of blather and nonsense isn't even valid XHTML transitional, much valid strict.

335 errors, 137 warnings for direct URI testing. Many of those errors are a sign that whoever wrote it gets html 4.1 and xhtml 1 confused. Lots of old attributes that are no longer valid are used.

You need to complain about how your website sucks.
Reply to this comment
by bj1126 January 21, 2009 9:03 AM PST
The difference is a government website has or is supposed to have, rules don't seem to apply to Obama anyways, much stricter requirements for design compliance.
by MSSlayer January 21, 2009 9:06 AM PST
All sites should pass the W3C validators.

The difference is CNET's websites are old and should have all the errors worked out.

Declan should know better than to throw stones while living in a glass house.

Let's see what whitehouse.gov "developers" do over the next month.
by PhaseDMA January 21, 2009 9:12 AM PST
I'm curious. Would you have checked if the author had not pointed it out? It doesn't much matter because he did point it out so I'm pretty sure he is aware of the issue.

All the author is doing is pointing out people are saying one thing, and in reality don't have a clue as to the merit of what they have to say.
by MSSlayer January 21, 2009 9:17 AM PST
No, but the results aren't surprising.

So what if I wouldn't have?

Does that let Declan off the hook?
by mikeburek January 21, 2009 9:01 AM PST
Declan is a reporter and he reports the facts on subject. He wrote a good article comparing the previous and current robots.txt file, and making a comparison to other sites.

People are already talking about this, so there is nothing wrong with him searching for the facts on rumors. It is a good thing.

This is not nit picking. People are paid lots of money to go through the same decision process that Declan just laid out.

TV's Mythbusters is built on listening to rumors and finding the facts. Declan just didn't have a film crew.

Thanks for the article.
Reply to this comment
by pentest January 21, 2009 7:41 PM PST
A text file is irrelevant.

It is a humorous article, but somehow I don't think that was his goal.
by michaelo1966 January 21, 2009 9:13 AM PST
Declan, for those who don't know, is the source of the "Al Gore invented the Internet" rumor (real quote that's not nearly use interesting: he took the initiative in preserving federal funding of it, which Declan likely understood at the time but reported incorrectly anyway). Many believe there's a strong right-wing bias to these "reports" and, unfortunately, they're reported as news without context, though sometimes they're skewed enough they'll use an opinion disclaimer.

Obama's site, as any casual observer recognizes, has more information than Bush's. Period. They'll no-doubt adjust their crawling strategies to optimize, just like any site administrator does. Bush clearly just stuck out a minimal template robots file then ignored it.

There's no news here except that one can imply Obama has more up-to-date, hands-on engineers running the White House website. Hopefully we won't see "shocking" news stories for the next four years that read something like "White House operatives using government computers to receive email lists of groceries from their spouses to pick up on their drive home!"
Reply to this comment
by Thought Nozzle January 21, 2009 9:50 AM PST
When a reader sees the headline about openness, then gets slammed with "There's just one problem with these comments. They're wrong." ...We're led to think that the Obama administration's version of the site is "less open" than the Bush version, or at least "only as open as". So, naturally, the reader thinks this has something to do with the content posted on the site.

But no... It turns out that Declan has another bur under his saddle about Robots.txt. Not about a real story like a true comparison of the information that the old and new administrations have posted.

What's worse is the tease in the CNET Morning News Dispatch email, which led me here:

"President Obama's new White House Web site has been lauded for being more open than former President Bush's. There's just one problem with that theory: it's wrong"

So, naturally, I wanted to know what they meant, thinking that CNET was accusing the Obama administration of hiding documents, or limiting the amount of information that they put up on the site, or obfuscating and stonewalling.

There's only one problem with the tease, and with the lede of the story: They're deceptive.

The underlying problem is that Declan and/or his editors have decided to use teasers and ledes to pump up a tiny nit-pick over robots.txt into something that sounds like an allegation of fascist control of the truth. Tabloid-style journalism [sic] won't help CNET's credibility, and in the tech community -- rife with skeptics and those interested in actual facts -- it will cost you readers.

If there's a life to be gotten, Declan and CNET clearly need to be pointed in its direction.
Reply to this comment
by pentest January 21, 2009 7:18 PM PST
Thank you for pointing out the obvious.

Declan and CNET should be ashamed.
by Thought Nozzle January 22, 2009 12:06 AM PST
Pentest - You're welcome for pointing out the obvious... While you and I easily see it as such, it seems that CNET and the author need to have the obvious pointed out to them.
by shralpmeister January 21, 2009 9:50 AM PST
Sheesh, it figures CNET would have a neocon as its cheif political correspondent.

Just visit http://whitehouse.gov. Every American should be proud of this site.

CNET, I think its time to get with the times and refresh your political correspondent.
Reply to this comment
by fool4jesus January 21, 2009 10:25 AM PST
Did you even read the article before going off on your anti-conservative screed? He was specifically responding to the other "news" stories fawning about how open Obama was because of his robots.txt file.
by dream_fly January 21, 2009 9:59 AM PST
I don't think openness is defined by what's in robots.txt, but rather the actual accessible contents that the site contains. A good comparison should be on the actual contents before and after. Otherwise it?s just geek talks.
Reply to this comment
by Apolune January 21, 2009 10:06 AM PST
It's only January, yet McCallagh is already the favorite to win Yutz of the Year for this piece. (And that haircut.)
Reply to this comment
by secretvan January 21, 2009 10:33 AM PST
Ok, Obama was just put into power yesterday and you are expecting miracles over night? Give it a few weeks and then revisit this article and see if it is valid.

One thing I worry about is the American people are expecting Obama to change the World. I admire the goals he has put forward and I think he is going to do great things but the Americans have to have realistic expectations of what can be done.
Reply to this comment
by January 21, 2009 10:45 AM PST
Excellent reporting. Exhilaration is great, but there's a fine line between being happy and kidding yourself. And although the site redesign is pretty, I see little difference, content-wise, between today and last week at whitehouse.gov--except that a lot of documents (current and archival) are no longer available. I'll give the administration a short-term break on that, since I know everything from any administration is archived during the transition.
But Al Gore started the rumor about inventing the internet, not any reporter. I've seen the video of him saying it, and I find it hard to believe anyone posting here hasn't seen it--and he didn't say he'd funded it, he said he and a colleague "invented a little thing called the internet."
Reply to this comment
by Pete Bardo January 21, 2009 11:49 AM PST
You've seen the video? Where? I'd really like to see that. Was that on a tv show or what? I'm no liberal, and certainly no conservative Republican, but I was under the impression that Al never said he invented the internet. Maybe you could post a link to the video you claim to have seen.
by pentest January 21, 2009 7:20 PM PST
Al Gore never said that.

He took credit for pushing funding for it.

I don't know if Declan was the one responsible for pushing the lie, but it was someone in the media.
by yamanoor January 21, 2009 11:05 AM PST
I am sure there are cooties in some White House toilet that weren't fixed. Let's go sit on those, or rather not...the article was probably okay up until the robots.txt file.
Reply to this comment
by jrjiminy January 21, 2009 11:47 AM PST
Yawn!
Reply to this comment
by TogetherinParis January 21, 2009 11:49 AM PST
www.change.gov was the transition website and they seem to be utilizing many of those features on the www.whitehouse.gov website, so it is indeed a far better portal already. Improvement after one day is nice. Give them credit, the Obama team seems bent on making further improvements.

Similar improvements were made to the Whitehouse switchboard early in the Clinton administration, but the republicans hired people to use modems to call in constantly and ruined it for everyone. Since the slush funds have since dried up for that sort of petty harassment (dirty tricks they were called), we can hope for a better government mechanism where more than the powerful can get their views expressed.
Reply to this comment
by joshma January 21, 2009 12:24 PM PST
Although HTML validation is usually important from a web designer's point of view, I don't see how it's a big deal when it comes to "spending tax money." The page is readable, and they're not using HTML frames or marquees - it's quite acceptable. This page on CNET seems to have 271 errors when validated - what a waste of investor money!
Reply to this comment
by JACAS16 January 21, 2009 12:36 PM PST
so what does it matter if W3C shows some errors for the white house website. that proofs nothing

i just dont think there's one perfect website

check out what the W3C has to say about CNET.COM and other websites.

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.CNET.COM&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.606

MICROSOFT.COM

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.MICROSOFT.com%2F&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.606

MIT.EDU

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.mit.edu&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.606
Reply to this comment
by pentest January 21, 2009 7:26 PM PST
MS is certainly not an example of a standards complaint organization, and a school is a school.

There are web designers who can create dynamic pages validate 100% every time. It really isn't that hard and should be standard, but too many "web developers" aren't qualified.

IMO, a browser should put up an error page whenever it gets a poorly written html file. Quirks mode is the worst idea in tech ever. It encourages and excuses sloppy work.

Imagine if a compiler or interpreter just guessed at what was meant when it comes across a syntax error. You think software is bad now?
by karpenterskids January 21, 2009 12:55 PM PST
I've never even heard of robots.txt before...but now I feel like my eyes are being opened to something huge. Very interesting indeed.
Reply to this comment
by pentest January 21, 2009 7:26 PM PST
It is a text file. Nothing more, nothing less.
by doshomik January 21, 2009 12:56 PM PST
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.whitehouse.gov%2F&charset=%28detect+automatically%29&doctype=Inline&group=0
Reply to this comment
by tkarmadragon January 21, 2009 1:04 PM PST
Somebody was comparing a shorter robots.txt file to greater political transparency? Whaaat?!

Man, the Obama fanaticism is insane. Even Jesus is starting to feel jealous.
Reply to this comment
by pentest January 21, 2009 7:27 PM PST
You have it backwards.

Declan seems to be implying that transparency is worse today then last week, based purely on the size of a text file.

I do agree with your sentiment: "Whaaat?!"
Showing 1 of 2 pages (46 Comments)
advertisement
Click Here

The browser battles go on and on

roundup From Firefox to IE and from Chrome to Opera and Safari, there's no sitting still for browser makers looking to keep their products fresh and competitive.

3G wireless still holds promise

The next generation of 4G wireless may get all the headlines, but advanced 3G technology will likely dominate services for the next few years.

About Politics and Law

News at the intersection of technology, politics, and law, ranging from intellectual property to censorship to tech policy.

Add this feed to your online news reader

Politics and Law topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right