Beginners Guide To Robots.txt Files | Sebastian X Sebastians Pamphlets

SebastianX - Sebastian’s PamphletsHobo – Right Sebastian! What do you think you are doing calling me out on a slight bit of “misinformation” on a post I made for a bit of branding. Just who do you think you are spamming my content with useful, original and interesting content?

Don’t you realize that @ 1,500 stumblers and Twitters visited my site as a result of this slapping?? You trying to discredit me? :)

Sebastian – Howdy Shaun, – I’m so sorry that I discredited you, that was really not my intention.

I couldn’t resist coz robots.txt is kinda pet peeve of mine. Thanks for the opportunity to spam your neat blog with my links thoughts, though. :)

Hobo: That post was about how expert SEO people were using Robots.txt – I should have put a disclaimer at the bottom saying I didn’t know a thing about Robots.xt files and that I had nicked mine some time ago from Michael Gray and forgot about it. And spam my blog all you like with that kind of content, although I’ve got Lucia’s Linky Love installed so generally Spam doesn’t get much of a foothold about these parts (actually I am not even sure if that is working properly).

OK – you seem to know what you’re on about when it comes to robots.txt. Fancy educating me and the Hobo team as to what you’ve learned and know about these often misunderstood files? You know, all that stuff that took you years to learn, Let me have it….now!

Hobo – WTF is a Robots.txt file, Sebastian, in simple idiot’s terms?

Well, the “idiot’s version” will lack interesting details, but it will get you started.

Robots.txt is a plain text file. You must not edit it with HTML editors, word processors, nor any applications other than a plain text editor like vi (Ok, notepad.exe is allowed too). You shouldn’t embed images and such, also any other HTML code is strictly forbidden.

Hobo – Why shouldn’t I edit it with my Dreamweaver FTP client, for instance?

Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:
<!DOCTYPE text/plain PUBLIC "-//W3C//DTD TEXT 1.0 Transitional//Swahili" "http://www.w3.org/TR/text/DTD/plain1-transitional.dtd">
{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 User-agent: Googlebot}{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /}{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} ...

(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)

Hobo – Where Do I put the damn thing?

Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example “/web/user/htdocs/example.com/robots.txt” resolving to http://example.com/robots.txt.

Can I use Robots.txt in sub directories?

Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt.

When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” – choose “ASCII” even when you’ve used an ANSI editor to create it.

Hobo – Why?

Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)

Hobo – What about if I am on a Free Host?

If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers.

Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.

Hobo – Sebastian, Do you know how search engines work, then?

Yep, to some degree. ;) Basically, a search engine has three major components:

  1. A crawler that burns your bandwidth fetching your unchanged files over and over until you’re belly up.
  2. An indexer that buries your stuff unless you’re Matt Cutts or blog on a server that gained search engine love making use of the cruelest black hat tactics you can think of.
  3. A query engine that accepts search queries and pulls results from the search index but ignores your stuff coz you’re neither me nor Matt Cutts.

Hobo – What goes into the robots.txt file?

Your robots.txt file contains useful but pretty much ignored statements like
# Please don't crawl this site during our business hours!

(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).

Hobo – I say index, you say crawl. You say tomato, I say….ah! I see!

Currently, there are only three statements you can use in robots.txt:

  1. Disallow: /path
  2. Allow: /path
  3. Sitemap: http://example.com/sitemap.xml

Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.

The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:
User-agent: *
Disallow:
Allow: /
Sitemap: http://example.com/sitemap.xml

If you’re comfortable with Google but MSN scares you, then write:
User-agent: *
Disallow:

User-agent: Googlebot
Disallow:

User-agent: msnbot
Disallow: /

Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.

From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a
User-agent: [crawler name]

line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (“User-agent: *”) section before you edit the code.

Now to the directives. The most important crawler directive is
Disallow: /path

“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (“*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).

URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.

Allow: path/
refines Disallow: statements, for example
User-agent: Googlebot
Disallow: /
Allow: /content/

allows crawling only within http://example.com/content/

Sitemap: http://example.com/sitemap.xml
points search engines that support the sitemaps protocol to the submission files.

Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ – The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.

Hobo – Say I want to keep a file / folder out of Google. Exactly what what would I need to do?

You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
<meta name=”Googlebot” content=”noindex,noarchive” />
Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.

Hobo – Where online do you hang out?

At Sphinn and Google’s Webmaster Help Group. For the latter some folks call me a slimy Google groupie, but I can perfectly live with that. Google’s SEO forum is a nice place to help noobs and discuss interesting topics as well.

Hobo – Who do you read every day/week?

Oh well. That’s a very long list. Probably the OPML file would be too large to email it. I read (sometimes skim) my friend’s posts daily, when I’m swamped at least weekly. I guess the best way to get a grip of my reading preferences is my shared feed, my list of stumbles, bookmarks, and sphinns.

Hobo – Tell me who your favourite music band is? Mine is the Stone Roses, have you heard of them?

Today that’s Ten Years After, yesterday it was Bob Dylan. Stone Roses is not on my radar, maybe I missed out on a great band?

Hobo – What else are you interested in online?

Tough question. What can a lonely geek do online? Viewing porn of course. Seriously, I consume more technical stuff than smut.

Hobo – I’ll send you a couple of links complete with free passwords I confiscated off my Managing Director, Michael ;)

Can’t wait for this list. If it contains passwords from one of my adult sites I’ll sue Michael! ;)

If someone wants to know more about robots.txt, where do they go?

Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.

Hobo – Can I ask you how you auto generate and mask robots.txt, or is that not for idiots? Is that even ethical?

Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.

Hobo – Think Disney will come after you for your avatar now you are famous after being interviewd on the Hobo blog?

Sebastian's avatarI’m sure they will try it, since your blog will become an authority on grumpy red crabs called Sebastian. I’m not too afraid though, because I use only a tiny thumbnailed version of an image created by a designer who –hopefully– didn’t scrape it from Disney, as icon/avatar. If they become nasty, I’ll just pay a license fee and change my avatar on all social media sites, but I doubt that’s necessary. To avoid such hassles I’ve bought an individually drawed red crab from an awesome cartoonist last year. That’s what you see on my blog, and I use it as avatar as well, at least with new profiles.

Hobo – What’s your day job? Who do you work for?

I’m a freelancer loosely affiliated with a company that sells IT consulting services in several industries. I do Web developer training, software design / engineering (mostly the architectural tasks), and grab development / (technical) SEO projects myself to educate yours truly. I’m a dad of three little monsters, working at home. If you want to hire me, drop me a line. ;)

Sebastian, a big thanks for slapping me about about Robots.txt and indeed for helping me craft the Idiot’s Guide To Robots.txt. I certainly learned a lot from talking to you for a day, and I hope some others can learn from this snippet article. You’re a gentleman spammer. :)


Sebastian is somewhat of a celebrity around the search engine marketing sphere. Check out his blog when you can while it’s free and before he does an internet marketing ninja / seomoz samurai and starts charging you for it.Hope you liked it – Shaun the Internet Marketing Hobo!

 

If you enjoyed this post, please share :)

Written by Shaun Anderson Hobo

18 Responses to “Beginners Guide To Robots.txt Files | Sebastian X Sebastians Pamphlets”

  1. Shaun Anderson says:

    I just noticed: I cannot believe you’ve never heard of the Stone Roses?! Sebastian, just how uncool are you :)

    I’ll send you a couple of MP3′s and those codes you are so interested in ;)

  2. Maki says:

    Stone Roses is awesome! Definitely one of the best Brit Pop bands ever. Ian Brown’s solo stuff is not bad as well.

    Sorry I can’t comment about the Robot.txt stuff… too geeky for me :)

  3. Shaun Anderson says:

    Hats off Maki! At last someone in this industry with a bit of good taste!

  4. Tim Nash says:

    I hope you feel totally educated :)
    Its worth pointing out that a) robots.txt is not always followed by all crawl engines b) It can cause some interesting problems with dangling pages like all things they should be used with care or at least with understanding and make sure if you are using multiple methods to control crawls they are not conflicting with each other.

  5. sebastian says:

    Shaun, that was a lot of fun and well worth a sleepless night. :)

    Tim, of course you’re right, there are lots of pitfalls (I could fill a book or two) when it comes to steering of search engine crawling and indexing, but this idiot guide will get folks started. From my Webmaster support activities I know that basic tutorials like this one are double edged swords, but hey, who will pay our bills if all advice is free and nicely gathered so that every freeloader can grab it? Good link by the way. :)

  6. JohnMu says:

    That was a fun and interesting interview! Thanks for putting that together, guys.

  7. Shaun Anderson says:

    Sebastian, The pleasure was all mine and the story’s about to go hot on Sphinn so it was well worth it…..I hope you and your wife forgive me for keeping you up all night and you made your appointment this morning with the kids :)

    John, cheers :)

    Apologies about the typos in this article but every time I edit it in WP it screws up the code Sebastian’s thoughtfully supplied code examples, so I’m letting it fly as is.

  8. sebastian says:

    Shaun, it is totally impossible to sleep when a crowd of yelling monsters dances in your bed – I grabbed at least a short nap by the way so it’s all fine with me. Of course I got up to feed, dress and distribute them to school and kindergarten. Nothing to forgive, I’ve enjoyed it. As for the typos, I know the secret procedure to edit them out. Hint: it involves a plain text editor. ;)

  9. Good interview, I’d been struggling to compartmentalise REP and noindex but this interview finally nailed it for me. Nice one.

    I hadn’t thought about generating robots.txt but now I’ve read Sebastian’s article I’ll definitely bear it in mind.

    One problem is that only the friendly, well-behaved crawlers obey robots.txt, the destructive spam-bots, scapers, referrer-spammers and bug-ridden college projects typically ignore it. This means its a case of diminishing returns investing in a generator.

  10. SearchCap: The Day In Search, January 10, 2008…

    Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….

  11. Paul-S says:

    Great stuff Shaun and Sebastian.

    That was well put together, funny and interesting. :)

  12. Shaun Anderson says:

    Cheers Paul – Who would have thought Robots.txt could have been so entertaining? :)

  13. kadin sitesi says:

    yess great post thanks

  14. Sebastian is quite diabolical! This post made my head hurt.

  15. Great stuff Shaun and Sebastian.

    That was well put together, funny and interesting.

  16. Jared says:

    I have a question, as i know the search engine robot always do not crawl the section where require password.Since then, what the robot file serve for? If there is material that you do not want to open to public, just set password and authority to enter, then ….

  17. ridhoyp says:

    i still didnt understand about robot.txt ?!@^#$^@

  18. Thanks for this post, I’m just sorry to have been so slow to pick up on it! I’ll be keeping a closer eye on my robots.txt file contents from now on for sure!

Subscribe & Get Your Free Beginners Guide To Google SEO!

Free SEO Ebook