Robots.txt
A text file placed at the root of a website that tells web crawlers which pages or sections to crawl or skip.
Also known as: Robots Exclusion Protocol, robots.txt file
Category: Concepts
Tags: seo, web, technical, crawling, protocols
Explanation
Robots.txt (the Robots Exclusion Protocol) is a plain text file at a site's root (e.g., example.com/robots.txt) that communicates crawling rules to web robots. It uses directives like User-agent (which crawler the rules apply to), Disallow (paths to skip), Allow (exceptions within disallowed directories), Crawl-delay (time between requests), and Sitemap (location of XML sitemaps).
How it works:
1. A crawler visits a site and first checks for /robots.txt
2. It reads the directives applicable to its User-agent
3. It follows (or ignores, for non-compliant bots) the specified rules
4. It proceeds to crawl allowed pages
Important limitations:
- Robots.txt is advisory, not enforced—malicious bots can ignore it
- It does not prevent indexing if other sites link to disallowed pages
- Disallowing a page doesn't remove it from search results (use noindex for that)
- Blocking CSS/JS files can hurt rendering and SEO
Common use cases:
- Prevent crawling of admin areas, staging environments, or internal search results
- Block crawling of duplicate content or low-value pages
- Manage crawl budget by directing bots to important content
- Point crawlers to the XML sitemap
Best practices: keep it simple, test with Google Search Console's robots.txt tester, never use it to hide sensitive information (it's publicly readable), and combine it with meta robots tags for full control.
Related Concepts
← Back to all concepts