Fans of President Barack Obama, or perhaps just those who dislike former President George W. Bush, seem to think there's something notable about the way the new White House Web site is configured to deal with search engines.
That configuration file is called robots.txt. It's designed to let Webmasters ask search engine robots not to include certain areas of a Web site in their index. Well-behaved robots will comply.
The Obama revamp of Whitehouse.gov included a shorter robots.txt file, which Thenextweb.com called "a sign of greater transparency and change." A BoingBoing poster claimed that now "people can find information that was restricted before." And so on.
There's just one problem with these comments. They're wrong. As of Tuesday morning, the Bush administration's robots.txt file did only two things: first, it pointed search engines to the high-graphics versions of the page, as opposed to the text-only versions, and second, it tried to keep type-in-your-search-query pages from being indexed.
Those are legitimate reasons to list those pages in robots.txt, which is why CNET's own file is relatively long and complicated too. (Sites that have been around for eight years or longer tend to get that way). We ask search engines not to index an "/Ads" directory, e-mail-this-story pages, and dozens of others. The Democrat-controlled House and Senate have--gasp!--substantial robots.txt files too.
It's true that in 2007, the Bush White House did block some files they should not have, which they fixed once I brought it to their attention. They also fixed a more serious problem with the Director of National Intelligence's Web site, and an earlier problem in 2003. (A better solution would be for search engines to ignore overly broad robots.txt files on .gov and .mil sites, including Thomas.loc.gov.)
If anything, Obama's robots.txt file is too short. It doesn't currently block search pages, meaning they'll show up on search engines--something that most site operators don't want and which runs afoul of Google's Webmaster guidelines. Those guidelines say: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."
And here's something sure to upset Obama-praising geeks: the new White House site doesn't pass the litmus test of good HTML design. Alas, according to the W3C, not all pages successfully validate. Those are your tax dollars at work.
P.S.: The White House seems to be using Akamai's Edge Platform for scalable Web hosting:
sh-2.05b$ host whitehouse.gov whitehouse.gov has address 18.104.22.168 whitehouse.gov mail is handled by 105 mailhub-wh3.whitehouse.gov. whitehouse.gov mail is handled by 100 mailhub-wh2.whitehouse.gov. sh-2.05b$ host www.whitehouse.gov www.whitehouse.gov is an alias for www.whitehouse.gov.edgekey.net. www.whitehouse.gov.edgekey.net is an alias for e2561.b.akamaiedge.net. e2561.b.akamaiedge.net has address 22.214.171.124 sh-2.05b$