• On MovieTome: See the villain of IRON MAN 2!
March 29, 2008 11:52 AM PDT

Analyze, create robots.txt files in Google

by Brian R. Brown
  • Font size
  • Print
  • Post a comment

Google's Webmaster Central has become a very important resource for anyone who has a Web site, works on a Web site, or, like SEO practitioners, helps others with their Web sites.

Google continues to roll out more features and better functionality to existing features, and now they just did a little bit of both with the addition of their Generate robots.txt function.

Google had previously added a robots.txt analyzer, which at this point is still the more useful of the two tools. For those who aren't aware, the robots exclusion protocol helps with instructing search engines how to interact with a Web site. There are a number of directives available, but the main purpose of the robots.txt file is to instruct the search engines about content that a site owner doesn't want the robots to crawl.

Why in the world would you not want search engines to crawl any of your content? You may have content that, for whatever reason, you don't want others to find through search results. Note, however, that this is not the same as secure information that requires authentication through a log-in.

Your site may have its own search function that creates "search results" for your site. Search engines generally do not want to include search results within search results, so this content may not be returned for searches on the engines anyway, so you might want to focus the crawlers elsewhere for greater crawler efficiency.

Or you may have duplicate content issues that you could use robots.txt to filter out. This is especially common with a content management system (CMS) that creates a separate printer-friendly page.

Regardless of your specific needs, having a robots.txt file can be important to a site. Rarely is there a site that can't benefit from disallowing at least some content. Even if you have nothing to disallow, you may want to take advantage of the auto-discovery feature for your XML sitemap. Finally, depending on your server log system or analytics package, not having a robots.txt file can be problematic if it inflates your "404 File Not Found" error reporting, which can happen because search engine spiders will request the robots.txt file automatically when they come to your site.

Right now, the robots.txt generator is rather basic and I hope that Google will add more features to it going forward. Currently, site owners have to paste in URLs and URL patterns to build the file. It would be great if it would provide a list of URLs or patterns extracted from a site to help automate the procedure for anyone not familiar with the protocol.

There is more information about the protocol, though a bit more on the technical side, at the robotstxt.org site and you can find more engine specific information on crawling and robots.txt from Google, Yahoo, MSN, and Ask.com.

One important tip is that the following directive tells all spiders they are allowed to go anywhere:

User-agent: *
Disallow:

And, more importantly, the following directive, which I sometimes see when I think people really wanted the above:

User-agent: *
Disallow: /

The latter tells the spiders to stay out of the entire site--clearly two very different results, so be sure you understand which does what.

Brian Brown is a Consultant & Natural Search Marketing Strategist for Netconcepts. He is a member of the CNET Blog Network, and is not an employee of CNET. Disclosure.
Recent posts from Searchlight
Be unique to avoid duplicate content
Selling duplicate content
Book review: How To Make Money With Your Blog
Yahoo Suggest: The Good, the Bad, and the Unbelievable
Understanding duplicate content: Outside view
Flickr adds video to photo sharing services
Duplicate content: Separating the penalty from the filter
Use SEO to optimize your recession
advertisement

S.F. hacker space: Heaven for the DIY set?

The Noisebridge hacker space offers sewing and Mandarin classes, soldering workshops, Internet-controlled front door access, and a server room with no door.
• Photos: Circuits, code, community

The browser battles go on and on

roundup From Firefox to IE and from Chrome to Opera and Safari, there's no sitting still for browser makers looking to keep their products fresh and competitive.

advertisement

About Searchlight

Search engine optimization expert Stephan Spencer and analysts from Net Concepts share late-breaking SEO tools, tips, trends, resources, news and insights. Stephan is the founder and president of Netconcepts, a web agency specializing in search engine optimized ecommerce. Clients include Discovery Channel, AOL, Home Shopping Network, Verizon SuperPages.com, and REI, to name a few. Stephan is a frequent speaker at Internet conferences around the globe. He is also a Senior Contributor to MarketingProfs.com, a monthly columnist for Practical Ecommerce, and he's been a contributor to DM News, Multichannel Merchant, Catalog Success, Catalog Age, and others. The blog is part of the CNET Blog Network and the authors are not employees of CNET. Disclosure.

Add this feed to your online news reader

Searchlight topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right