SEO: Robots.txt Files A Beginners Guide



Sebastian was nice enough to help me create an idiot’s guide to Robots.txt…

Well, the “idiot’s version” will lack interesting details, but it will get you started. Robots.txt is a plain text file. You must not edit it with HTML editors, word processors, nor any applications other than a plain text editor like vi (Ok, notepad.exe is allowed too). You shouldn’t embed images and such, also any other HTML code is strictly forbidden.

Why shouldn’t I edit it with my Dreamweaver FTP client, for instance?

Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:

DOCTYPE text/plain PUBLIC 
"-//W3C//DTD TEXT 1.0 Transitional//Swahili" 
"http://www.w3.org/TR/text/DTD/plain1-transitional.dtd"> 
{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 
User-agent: Googlebot}
{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line 
Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095
{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} ...

 

 

(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)

Where Do I put robots.txt

Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example

"/web/user/htdocs/example.com/robots.txt"

resolving to

http://example.com/robots.txt.

Can I use Robots.txt in sub directories?

Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt. When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” – choose “ASCII” even when you’ve used an ANSI editor to create it.

Why?

Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)

What about if I am on a Free Host?

If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers. Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.

Sebastian, Do you know how search engines work, then?

Yep, to some degree. ;) Basically, a search engine has three major components: A crawler that burns your bandwidth fetching your unchanged files over and over until you’re belly up. An indexer that buries your stuff unless you’re Matt Cutts or blog on a server that gained search engine love making use of the cruelest black hat tactics you can think of. A query engine that accepts search queries and pulls results from the search index but ignores your stuff coz you’re neither me nor Matt Cutts.

What goes into the robots.txt file?

Your robots.txt file contains useful but pretty much ignored statements like

 # Please don't crawl this site during our business hours!

(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).

Currently, there are only three statements you can use in robots.txt:

Disallow: /path
Allow: /path
Sitemap: http://example.com/sitemap.xml

Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.

The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:


 User-agent: *
 Disallow:
 Allow: /
 Sitemap: http://example.com/sitemap.xml 

 

If you’re comfortable with Google but MSN scares you, then write:


 User-agent: *
 Disallow:
User-agent: Googlebot
 Disallow:
User-agent: msnbot
 Disallow: /

 

Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.

From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a


 User-agent: [crawler name]

line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (“User-agent: *”) section before you edit the code.

Now to the directives. The most important crawler directive is


 Disallow: /path

“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (“*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).

URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.

 

Allow: path/

refines Disallow: statements, for example

 User-agent: Googlebot 
 Disallow: / 
 Allow: /content/

allows crawling only within http://example.com/content/

 

Sitemap: http://example.com/sitemap.xml

points search engines that support the sitemaps protocol to the submission files.

Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ – The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.

Say I want to keep a file / folder out of Google. Exactly what what would I need to do? 

You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
(*meta name=”Googlebot” content=”noindex,noarchive” /*). Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.

If someone wants to know more about robots.txt, where do they go?

Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.

Can I ask you how you auto generate and mask robots.txt, or is that not for idiots? Is that even ethical?

Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.

Think Disney will come after you for your avatar now you are famous after being interviewd on the Hobo blog?

Sebastian's avatarI’m sure they will try it, since your blog will become an authority on grumpy red crabs called Sebastian. I’m not too afraid though, because I use only a tiny thumbnailed version of an image created by a designer who –hopefully– didn’t scrape it from Disney, as icon/avatar. If they become nasty, I’ll just pay a license fee and change my avatar on all social media sites, but I doubt that’s necessary. To avoid such hassles I’ve bought an individually drawed red crab from an awesome cartoonist last year. That’s what you see on my blog, and I use it as avatar as well, at least with new profiles.

Who do you work for?

I’m a freelancer loosely affiliated with a company that sells IT consulting services in several industries. I do Web developer training, software design / engineering (mostly the architectural tasks), and grab development / (technical) SEO projects myself to educate yours truly. I’m a dad of three little monsters, working at home. If you want to hire me, drop me a line. ;)

Sebastian, a big thanks for slapping me about about Robots.txt and indeed for helping me craft the Idiot’s Guide To Robots.txt. I certainly learned a lot from talking to you for a day, and I hope some others can learn from this snippet article. You’re a gentleman spammer. :)

If you enjoyed this step by step guide for beginners – you can take your knowledge to the next level at http://sebastians-pamphlets.com/

What Google says about Robots txt files

A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.) To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools. You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

Related Articles

 


Loading Facebook Comments ...

8 Responses

  1. Maki says:

    Stone Roses is awesome! Definitely one of the best Brit Pop bands ever. Ian Brown’s solo stuff is not bad as well. Sorry I can’t comment about the Robot.txt stuff… too geeky for me :)

  2. Tim Nash says:

    I hope you feel totally educated :) Its worth pointing out that a) robots.txt is not always followed by all crawl engines b) It can cause some interesting problems with dangling pages like all things they should be used with care or at least with understanding and make sure if you are using multiple methods to control crawls they are not conflicting with each other.

  3. sebastian says:

    Shaun, that was a lot of fun and well worth a sleepless night. :) Tim, of course you’re right, there are lots of pitfalls (I could fill a book or two) when it comes to steering of search engine crawling and indexing, but this idiot guide will get folks started. From my Webmaster support activities I know that basic tutorials like this one are double edged swords, but hey, who will pay our bills if all advice is free and nicely gathered so that every freeloader can grab it? Good link by the way. :)

  4. JohnMu says:

    That was a fun and interesting interview! Thanks for putting that together, guys.

  5. Shaun Anderson says:

    Sebastian, The pleasure was all mine and the story’s about to go hot on Sphinn so it was well worth it…..I hope you and your wife forgive me for keeping you up all night and you made your appointment this morning with the kids :) John, cheers :) Apologies about the typos in this article but every time I edit it in WP it screws up the code Sebastian’s thoughtfully supplied code examples, so I’m letting it fly as is.

  6. sebastian says:

    Shaun, it is totally impossible to sleep when a crowd of yelling monsters dances in your bed – I grabbed at least a short nap by the way so it’s all fine with me. Of course I got up to feed, dress and distribute them to school and kindergarten. Nothing to forgive, I’ve enjoyed it. As for the typos, I know the secret procedure to edit them out. Hint: it involves a plain text editor. ;)

  7. I wanna be adored says:

    Good interview, I’d been struggling to compartmentalise REP and noindex but this interview finally nailed it for me. Nice one. I hadn’t thought about generating robots.txt but now I’ve read Sebastian’s article I’ll definitely bear it in mind. One problem is that only the friendly, well-behaved crawlers obey robots.txt, the destructive spam-bots, scapers, referrer-spammers and bug-ridden college projects typically ignore it. This means its a case of diminishing returns investing in a generator.

  8. Search Engine Land: News About Search Engines & Search Marketing says:

    SearchCap: The Day In Search, January 10, 2008… Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….



Learn how you can get more sales from your website

Subscribe for free and let us share with you:

  • how to submit your site to Google, Yahoo & Bing
  • how to optimise your site to get more traffic from Google
  • how to target the most valuable keywords for your business
  • how to make your site rank better in free Google listings
  • how to rank high & avoid Google penalties in 2014

Trust Hobo with your SEO plan

Find out more