Robots.txt vs. llms.txt: Controlling AI Crawlers in 2026
Table of Contents
Your content might be training large language models right now. The robots.txt protocol has a way to block AI training crawlers — but it only works for bots that respect it. And a new proposal called llms.txt is trying to create a better standard for AI-readable content. Here's the current state of both, what works today, and what's still speculative.
Which AI Companies Are Crawling the Web
As of 2026, major AI companies operate web crawlers for training data and search index building. The main ones you need to know:
| Crawler | Company | User-Agent |
|---|---|---|
| GPTBot | OpenAI | GPTBot |
| ChatGPT-User | OpenAI | ChatGPT-User |
| Claude-Web | Anthropic | ClaudeBot |
| Google-Extended | Google-Extended | |
| PerplexityBot | Perplexity AI | PerplexityBot |
| FacebookBot | Meta | FacebookBot |
| CCBot | Common Crawl | CCBot |
All of these claim to respect robots.txt Disallow rules. Most do. Some smaller training data crawlers don't honor the protocol at all — there's no enforcement mechanism beyond reputation.
How to Block AI Crawlers in Robots.txt Today
To block specific AI training bots, add them as named User-agents with Disallow: / (or a more targeted path):
User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: CCBot Disallow: / User-agent: * Disallow: /admin/ Disallow: /account/ Sitemap: https://yoursite.com/sitemap.xml
Order matters only within each User-agent block. The wildcard User-agent (*) catch-all at the end applies to all crawlers not specifically listed. Named user-agents take precedence over the wildcard for those specific bots.
Sell Custom Apparel — We Handle Printing & Free ShippingWhat Is llms.txt and Where Did It Come From?
llms.txt is a community proposal (not an official standard) suggesting a new file format that sites could use to provide AI-readable documentation. The idea: put an llms.txt file at yoursite.com/llms.txt with curated, concise summaries of what your site contains — structured specifically for how LLMs process information.
The format looks like a simple markdown file with links and brief descriptions of your important content. The pitch is that instead of AI crawlers scraping your full site and extracting whatever they want, you give them a cleaner, intentional summary.
As of early 2026, llms.txt has no official adoption from major AI companies. OpenAI, Anthropic, and Google have not committed to reading or respecting it. It's an interesting proposal but not yet a real control mechanism. Some documentation-heavy sites (particularly developer tools and SaaS products) have adopted it proactively.
What Actually Works Right Now
Robots.txt is the only mechanism with confirmed, stated support from major AI labs. OpenAI, Anthropic, and Google have all published documentation confirming their crawlers respect robots.txt Disallow rules for their AI training operations.
Key caveats:
- This only blocks future crawling. If your content was already scraped and used in training before you added the Disallow rule, that's not undone.
- This blocks training crawlers, but may not affect search/retrieval components of the same products. GPTBot is separate from Bing's regular crawler that powers some AI search features.
- Rogue crawlers that don't honor the protocol ignore your robots.txt entirely. No technical enforcement exists.
If protecting your content from AI training is important (for example, if you run a paid content site or creative work repository), add the specific agent Disallow rules. It's the best available tool even with these limitations.
The Other Side: Reasons to Allow AI Crawlers
Not everyone should block AI crawlers. There are real arguments for allowing them:
- Citation and referral traffic — AI assistants that cite sources can send visitors to your site. Blocking the crawler may reduce how often you're cited.
- AI search ranking — Perplexity, ChatGPT search, and similar products crawl sites to power search features. Blocking may reduce your presence in AI-powered search.
- Training data as distribution — Some argue that having your brand mentioned in AI training data is a form of marketing. Models learn your brand exists.
The right call depends on what you're protecting. A news publisher with paywalled content should block. A free tool or software documentation site may benefit more from AI citation than it loses from training use.
Try It Free — No Signup Required
Runs 100% in your browser. No data is collected, stored, or sent anywhere.
Open Free Robots.txt GeneratorFrequently Asked Questions
Does blocking AI crawlers in robots.txt remove my content from ChatGPT?
No. It prevents future crawling for training data. Content already included in training data before you added the block remains there. There's no removal mechanism equivalent to Google's deindex request.
Is llms.txt an official standard I should implement?
Not yet. It's a community proposal with no confirmed adoption from major AI companies. It's worth watching, but there's no urgency to implement it in 2026.
Which AI crawlers are most important to block?
GPTBot (OpenAI), Google-Extended (Google AI training), ClaudeBot (Anthropic), and CCBot (Common Crawl, which many AI datasets are built from). These cover the highest-volume AI training sources.
Does blocking Google-Extended affect my Google search rankings?
Google says Google-Extended only affects AI training (Gemini, etc.) and not Google Search crawling. The regular Googlebot is unaffected. But monitor your traffic after adding the block to be safe.
What user-agent does Anthropic's Claude crawler use?
ClaudeBot is the user-agent for Anthropic's web crawler. There's also Claude-SearchBot for search-related crawling. Add both if you want comprehensive blocking.

