Robots.txt Best Practices For Beginners

After more than two decades immersed in the world of SEO and website development, I know that files like robots.txt can seem intimidating. They feel like the domain of developers, full of strange syntax and the potential to break your entire website with a single misplaced character.

This topic is one I’ve been discussing for a long time.

Back in 2008, I published an interview on this blog with my friend (then well-known SEO) Sebastian’s Pamphlets, breaking down the basics of robots.txt. It was a fun piece that even received a comment from a then-lesser-known Googler named John Mueller, who said, “That was a fun and interesting interview! Thanks for putting that together, guys.”

A lot has changed since 2008, and Sebastian’s Pamphlets is no longer with us, but the core principles of robots.txt we discussed have remained remarkably stable.

My mission with this updated replacement guide is to revisit those fundamentals, incorporating some of Sebastian’s timeless advice with the strategic lessons I’ve learned from over 20 years of hands-on experience. We’re not just going to cover the “what” – the syntax and the rules. We’re going to focus on the far more important “why.” Why does this simple text file matter so much for your site’s health? How does it relate to your business goals?

Think of your robots.txt file as the very first conversation you have with search engine crawlers like Googlebot. It’s the fundamental dialogue that sets the rules of engagement. It’s how you ensure your digital storefront has clear signage and open doors for your most important customer: Google. Let’s make sure you’re giving them the right instructions.

What is a Robots.txt File (and What It Is NOT)?

At its core, a robots.txt file is a plain text file that lives at the root of your domain. You can see ours by typing https://www.hobo-web.co.uk/robots.txt into your browser. Its primary, officially stated purpose is to manage crawler traffic to your site, typically to prevent your server from being overwhelmed with requests.1

However, the most important lesson I can teach you is about what robots.txt is NOT. Understanding this will save you from some of the most common and costly SEO mistakes. As Sebastian wisely pointed out all those years ago, “all robots.txt directives are crawler directives that don’t affect indexing.”

A robots.txt file is NOT a reliable method for:

Preventing a page from being indexed.
Securing private or sensitive content.

This is a critical distinction. The Disallow directive in a robots.txt file is a polite request, not an enforceable command. While reputable crawlers like Googlebot will honour it, many others will not.1

More importantly, as Google’s own documentation warns, if a page you’ve disallowed is linked to from another website, Google can still find and index that URL.1 It won’t crawl the page, so the search result will appear without a description, but the URL itself can still show up. If you truly want to keep a page out of Google’s search results, the correct method is to allow crawling and use a

noindex meta tag on the page itself.1 For securing sensitive information, you must always use proper server-side security, like password protection.

This leads to a crucial point that many beginners miss: your robots.txt file is a public document. Anyone—including your competitors or malicious actors—can view it to understand the structure of your website.2 If you include a line like

Disallow: /admin-portal/, you are not hiding that directory; you are putting up a public signpost that points directly to it. Never use robots.txt as a security measure.

Crawl Budget & Your Site’s Technical Health

So, if robots.txt isn’t for blocking indexing, what is its main strategic purpose for SEO? The answer is crawl budget optimisation.

Google allocates a finite amount of resources—time and computing power—to exploring any given website. We call this the “crawl budget“. For small websites, this is rarely an issue. But for larger sites, especially e-commerce stores with thousands of pages and complex filtering options, it’s a critical factor.

A poorly configured (or non-existent) robots.txt file can lead to catastrophic “Crawl Budget Wastage” on larger sites. This happens when Googlebot spends the majority of its time crawling thousands of low-value, duplicate, or unimportant URLs. Think of pages generated by faceted navigation (sorting by price, colour, size), internal search results, or endless tag and archive pages.

When Googlebot is busy with this digital clutter, it may never get around to crawling your most important, revenue-generating pages. Those pages then languish, uncrawled and unranked.

Your robots.txt file is your primary tool for actively managing this budget. By using it correctly, you can guide Googlebot away from the unimportant sections and direct its full attention to the content that truly matters to your business.

If you have a small site (fewer than hundreds of thousands of sites), this is much less of an issue.

The Anatomy of a Robots.txt File: A Simple Breakdown

A robots.txt file is made up of one or more “groups.” Each group applies to a specific crawler (a User-agent) and contains a set of rules (Allow or Disallow) that tell that crawler what it can and cannot access.

Here is a simple, scannable breakdown of the most common directives you will use.

Directive	Purpose	Example
User-agent	Specifies which crawler the following rules apply to. The asterisk (*) is a wildcard for all bots.	User-agent: Googlebot
Disallow	Tells the user-agent not to crawl a specific file or directory path.	Disallow: /private/
Allow	Overrides a Disallow rule for a specific sub-directory or file. It’s used to create exceptions.	Allow: /private/public.html
Sitemap	States the location of your XML sitemap. This is optional but a highly recommended best practice.	Sitemap: https://www.example.com/sitemap.xml

A Deeper Look at the Directives

User-agent: This is the first line of any rule group. It identifies the specific bot you’re giving instructions to. While you can target individual bots like Googlebot or Bingbot, the most common entry is User-agent: *, where the asterisk acts as a wildcard that applies the rules to all crawlers.
Disallow: This is the instruction to block. The path that follows must be relative to the root domain (it must start with a /). For example, to block the folder https://yourdomain.com/notes/, the rule would be Disallow: /notes/.
Allow: This directive is less common but very powerful. It’s used to create an exception to a Disallow rule. For instance, you might disallow an entire folder but want to allow access to a single file within it.
Sitemap: While not a crawling rule, including a link to your XML sitemap is a crucial best practice. It acts as a clear map for search engines, showing them all the URLs you want them to discover and index. It’s a fundamental part of learning how to get Google to crawl and index your website.

Building Your First Robots.txt File: A Step-by-Step Guide

Ready to create your own? Some of the best advice on this is timeless.

Step 1: Create the File. As Sebastian correctly stated in our 2008 interview, this is a non-negotiable first step: “Robots.txt is a plain text file. You must not edit it with HTML editors, word processors, nor any applications other than a plain text editor”. Using a program like Microsoft Word will add formatting characters that will break the file. Stick to Notepad (on Windows) or TextEdit (on Mac), and save the empty file with the exact name robots.txt. Ensure the encoding is set to UTF-8.
Step 2: Add Your Rules. For a standard WordPress site or a small business website, a simple and safe starting point is often best. Copy and paste the following rules into your text file:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yourdomain.com/sitemap_index.xml

This configuration blocks crawlers from the WordPress admin area (which they don’t need to see) but specifically allows access to the admin-ajax.php file, which is sometimes needed for proper page rendering. Remember to replace the sitemap URL with your own.
Step 3: Upload the File. The location is just as important as the content. Sebastian’s advice from 2008 is still the golden rule: “Robots.txt resides in the root directory of your Web space”. This means it must be accessible at https://yourdomain.com/robots.txt. If you place it in a subfolder, search engines will not find it, and it will be completely ignored.
Step 4: Test Your File. Before and after you make any changes, you must test your file. You can use the “URL Inspection” tool within Google Search Console to see if a specific URL is blocked for Googlebot. There are also many third-party robots.txt validator tools available online that can help you check for syntax errors.

Common Mistakes That Can Harm Your SEO (And How to Avoid Them)

Over my career, I’ve seen a few simple robots.txt errors cause devastating damage to a site’s visibility. Here are the most common ones to watch out for.

Mistake 1: The Accidental Full Block (Disallow: /)
This is the most dangerous mistake. A single line, Disallow: /, tells every crawler to stay away from your entire website. I’ve seen this happen countless times when a developer creates a robots.txt file to block a staging or development site and then accidentally pushes that same file to the live production server, making the entire site invisible overnight. Always double-check your file before uploading.
Mistake 2: Blocking CSS and JavaScript Files
In the early days of SEO, it was common to block crawlers from resource folders like /css/ or /js/ to “save” crawl budget. This is now a terrible idea. Google needs to render your pages – to see them as a user would – to properly understand their content and layout. Blocking these essential resource files can lead to Google misinterpreting your site as broken, low-quality, or not mobile-friendly.
Mistake 3: Forgetting Case Sensitivity
The paths in a robots.txt file are case-sensitive. This means a rule like Disallow: /My-Folder/ will not block access to /my-folder/. A simple typo in capitalisation can render your rule completely ineffective.
Mistake 4: Blocking Canonicalised URLs
This is a more advanced but critical error. Let’s say you have a duplicate page (page-B.html) and you add a canonical tag to it that points to the original version (page-A.html). If you then block page-B.html in your robots.txt, Google will never be able to crawl it to see the canonical instruction. This defeats the entire purpose of managing your duplicate content and canonicalization signals effectively.

Your Next Steps in Mastering Technical SEO

Congratulations. By understanding the robots.txt file, you’ve mastered a foundational piece of technical SEO. You’re no longer just creating content; you’re actively managing the conversation between your website and the search engines that determine its success.

The perfect next step on your journey is to download my free Beginner SEO guide. It’s a comprehensive guide that builds on these technical fundamentals, covering everything from content strategy and link earning to keyword research, all grounded in a modern, “people-first” approach that works today.

Key Takeaways

To summarise the most critical points from this guide:

robots.txt manages crawling (what search engines can look at), not indexing (what they show in results). Use noindex tags to reliably keep pages out of search results.
Your file must be named exactly robots.txt and be placed in the root directory of your domain; otherwise, it will be ignored.
Never block essential resources like CSS or JavaScript files. Google needs to see your page as a user does.
Always test your robots.txt file after making changes to avoid the catastrophic mistake of accidentally blocking your entire site.
A clean, correct robots.txt file is a small but important signal of a professionally managed, trustworthy website, which is foundational to E-E-A-T.

Frequently Asked Questions (FAQ)

What’s the difference between blocking in robots.txt and using a ‘noindex’ tag?

robots.txt tells Google “don’t even look at this page.” A noindex tag tells Google “you can look at this page, but don’t show it in search results.” If you block a page in robots.txt, Google might still index its URL if it finds links to it from other sites. If you want to guarantee a page is removed from the index, you must allow Google to crawl it so it can see the noindex tag.

How do I add my sitemap to my robots.txt file?

Simply add a new line anywhere in the file (though typically at the top or bottom for clarity) with the following format: Sitemap: https://www.yourdomain.com/sitemap.xml. Be sure to replace the example URL with the actual, full URL of your sitemap.

Can a mistake in robots.txt really make my whole site invisible to Google?

Yes, absolutely. A single line, Disallow: /, will instruct all compliant search engine crawlers not to crawl any pages on your site. This is one of the most common and devastating technical SEO mistakes, often happening when a development site’s settings are accidentally moved to a live site.

Concluding Summary

Mastering the fundamentals is the key to building a resilient, long-term SEO strategy that stands the test of time. The robots.txt file is one of those fundamentals. By taking control of this simple file, you are taking a professional approach to managing your site’s health and aligning your goals with Google’s guidelines. You are telling the world’s most powerful search engine exactly how you want it to interact with your property, ensuring its resources are focused on the content you’ve worked so hard to create.

My friend Sebastian, a technically-minded peer who helped shape the original 2008 interview, had a knack for cutting through the noise. His advice then is just as critical now for anyone wanting to keep a page out of Google: “Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed.“

Author Bio: Shaun Anderson is the founder of Hobo Web and has been a professional website designer, developer and SEO since 2001. With over two decades of experience, his work focuses on ethical, ‘white hat’ strategies that align with Google’s guidelines to deliver sustainable results. You can read more about his journey on his full bio page.

Works cited

Robots.txt Introduction and Guide | Google Search Central | Documentation, accessed September 23, 2025, https://developers.google.com/search/docs/crawling-indexing/robots/intro
How Search Engines Work: Crawling, Indexing, and Ranking – Beginner’s Guide to SEO, accessed September 23, 2025, https://moz.com/beginners-guide-to-seo/how-search-engines-operate
Robots.txt: Best Practices for SEO – Nightwatch.io, accessed September 23, 2025, https://nightwatch.io/blog/robots-txt-best-practices-for-seo/
Robots.txt Files: The Complete Guide to Implementation and Best Practices, accessed September 23, 2025, https://www.basesearchmarketing.com/blog/robots-txt-files-the-complete-guide-to-implementation-and-best-practices/
Crawl Budget Management For Large Sites | Google Search Central | Documentation, accessed September 23, 2025, https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget
How to optimize your crawl budget – Yoast, accessed September 23, 2025, https://yoast.com/crawl-budget-optimization/
Maximizing SEO in 2024: The Role of Crawl Budget Optimization | by Tomas Laurinavicius, accessed September 23, 2025, https://medium.com/@tomaslau/maximizing-seo-in-2024-the-role-of-crawl-budget-optimization-a16d51a77bf7
Technical SEO Audit using the Hobo SEO Dashboard – Hobo, accessed September 23, 2025, https://www.hobo-web.co.uk/technical-seo-audit-using-the-hobo-seo-dashboard/
What Is a Crawling Budget, and How Can You Optimize Website Scans? – Netpeak Software, accessed September 23, 2025, https://netpeaksoftware.com/blog/what-is-a-crawling-budget-and-how-can-you-optimize-website-scans
Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/
Hobo SEO Dashboard Multi-Site in Google Sheets, accessed September 23, 2025, https://www.hobo-web.co.uk/seo-dashboard/
The Clients Tab in Hobo SEO Dashboard, accessed September 23, 2025, https://www.hobo-web.co.uk/hobo-seo-dashboard-clients-tab/
Create and Submit a robots.txt File | Google Search Central | Documentation, accessed September 23, 2025, https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
How to get Google to crawl and index your website quickly, accessed September 23, 2025, https://www.hobo-web.co.uk/how-to-get-google-to-crawl-and-index-your-website-fully/
8 Common Robots.txt Issues & And How To Fix Them – Search Engine Journal, accessed September 23, 2025, https://www.searchenginejournal.com/common-robots-txt-issues/437484/
robots.txt report – Search Console Help, accessed September 23, 2025, https://support.google.com/webmasters/answer/6062598?hl=en
How We Fix The ‘Blocked by robots.txt’ Error in Google Search Console – Embarque.io, accessed September 23, 2025, https://www.embarque.io/post/fix-blocked-by-robots-txt-error-in-google-search-console
Robots.txt Mistakes That Can Kill Your Rankings – SEO – Arrowpace, accessed September 23, 2025, https://arrowpace.in/robots-txt-mistakes-that-can-kill-your-rankings/
5 Common robots.txt File Mistakes and How to Avoid Them – Jain Technosoft, accessed September 23, 2025, https://www.jaintechnosoft.com/blog/5-common-robots-txt-file-mistakes-and-how-to-avoid-them
14 Common WordPress Robots.txt Mistakes to Avoid – WP Rocket, accessed September 23, 2025, https://wp-rocket.me/blog/common-wordpress-robots-txt-mistakes/
21 Common Robots.txt Issues (and How to Avoid Them) – seoClarity, accessed September 23, 2025, https://www.seoclarity.net/blog/understanding-robots-txt
Duplicate Content SEO – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/duplicate-content-problems/
Hobo E-E-A-T Review of Your Website And Prioritisation of EEAT-related Tasks – Hobo, accessed September 23, 2025, https://www.hobo-web.co.uk/hobo-e-e-a-t-review-of-your-website-and-prioritisation-of-eeat-related-tasks/
How Aiming to Meet the needs of Section 2.5.2 of the Google Quality Rater Guidelines triggers a compliance Domino Effect – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/section-2-5-2-of-the-google-quality-rater-guidelines/
Google EEAT Tool Multi-site (Beta) – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/eeat-tool/
E-E-A-T checklist for SEO – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/e-e-a-t-seo-checklist/
SEO tutorial for beginners – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/seo-tutorial/
NEW – Hobo: Beginner SEO 2025 – Free Ebook (PDF), accessed September 23, 2025, https://www.hobo-web.co.uk/beginner-seo/
Free SEO Ebook PDF Download – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/free-seo-ebook-pdf/
A History of SEO in Hobo SEO Ebooks 2009-2025, accessed September 23, 2025, https://www.hobo-web.co.uk/free-seo-ebooks/
SEO is Dead (2025 Edition) – Hobo, accessed September 23, 2025, https://www.hobo-web.co.uk/seo-is-dead/
Hobo Google SEO Ebook v1-1 | PDF | Computers | Technology & Engineering – Scribd, accessed September 23, 2025, https://www.scribd.com/doc/61590075/Hobo-Google-SEO-E-Book-v1-1
Shaun Anderson – Hobo SEO Auditor, accessed September 23, 2025, https://www.hobo-web.co.uk/shaun-anderson/

Disclosure: Hobo Web uses generative AI when specifically writing about our own experiences, ideas, stories, concepts, tools, tool documentation or research. Our tools of choice for this process is Google Gemini Pro 2.5 Deep Research. This assistance helps ensure our customers have clarity on everything we are involved with and what we stand for. It also ensures that when customers use Google Search to ask a question about Hobo Web software, the answer is always available to them, and it is as accurate and up-to-date as possible. All content was verified as correct by Shaun Anderson. See our AI policy.