Blog
Wild & Free Tools

Robots.txt vs. llms.txt: Controlling AI Crawlers in 2026

Last updated: April 2026 7 min read

Table of Contents

  1. The AI crawler landscape
  2. Blocking AI crawlers with robots.txt
  3. What llms.txt is
  4. Robots.txt works today
  5. Should you block or allow AI crawlers?
  6. Frequently Asked Questions

Your content might be training large language models right now. The robots.txt protocol has a way to block AI training crawlers — but it only works for bots that respect it. And a new proposal called llms.txt is trying to create a better standard for AI-readable content. Here's the current state of both, what works today, and what's still speculative.

Which AI Companies Are Crawling the Web

As of 2026, major AI companies operate web crawlers for training data and search index building. The main ones you need to know:

CrawlerCompanyUser-Agent
GPTBotOpenAIGPTBot
ChatGPT-UserOpenAIChatGPT-User
Claude-WebAnthropicClaudeBot
Google-ExtendedGoogleGoogle-Extended
PerplexityBotPerplexity AIPerplexityBot
FacebookBotMetaFacebookBot
CCBotCommon CrawlCCBot

All of these claim to respect robots.txt Disallow rules. Most do. Some smaller training data crawlers don't honor the protocol at all — there's no enforcement mechanism beyond reputation.

How to Block AI Crawlers in Robots.txt Today

To block specific AI training bots, add them as named User-agents with Disallow: / (or a more targeted path):

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /admin/
Disallow: /account/

Sitemap: https://yoursite.com/sitemap.xml

Order matters only within each User-agent block. The wildcard User-agent (*) catch-all at the end applies to all crawlers not specifically listed. Named user-agents take precedence over the wildcard for those specific bots.

Sell Custom Apparel — We Handle Printing & Free Shipping

What Is llms.txt and Where Did It Come From?

llms.txt is a community proposal (not an official standard) suggesting a new file format that sites could use to provide AI-readable documentation. The idea: put an llms.txt file at yoursite.com/llms.txt with curated, concise summaries of what your site contains — structured specifically for how LLMs process information.

The format looks like a simple markdown file with links and brief descriptions of your important content. The pitch is that instead of AI crawlers scraping your full site and extracting whatever they want, you give them a cleaner, intentional summary.

As of early 2026, llms.txt has no official adoption from major AI companies. OpenAI, Anthropic, and Google have not committed to reading or respecting it. It's an interesting proposal but not yet a real control mechanism. Some documentation-heavy sites (particularly developer tools and SaaS products) have adopted it proactively.

What Actually Works Right Now

Robots.txt is the only mechanism with confirmed, stated support from major AI labs. OpenAI, Anthropic, and Google have all published documentation confirming their crawlers respect robots.txt Disallow rules for their AI training operations.

Key caveats:

If protecting your content from AI training is important (for example, if you run a paid content site or creative work repository), add the specific agent Disallow rules. It's the best available tool even with these limitations.

The Other Side: Reasons to Allow AI Crawlers

Not everyone should block AI crawlers. There are real arguments for allowing them:

The right call depends on what you're protecting. A news publisher with paywalled content should block. A free tool or software documentation site may benefit more from AI citation than it loses from training use.

Try It Free — No Signup Required

Runs 100% in your browser. No data is collected, stored, or sent anywhere.

Open Free Robots.txt Generator

Frequently Asked Questions

Does blocking AI crawlers in robots.txt remove my content from ChatGPT?

No. It prevents future crawling for training data. Content already included in training data before you added the block remains there. There's no removal mechanism equivalent to Google's deindex request.

Is llms.txt an official standard I should implement?

Not yet. It's a community proposal with no confirmed adoption from major AI companies. It's worth watching, but there's no urgency to implement it in 2026.

Which AI crawlers are most important to block?

GPTBot (OpenAI), Google-Extended (Google AI training), ClaudeBot (Anthropic), and CCBot (Common Crawl, which many AI datasets are built from). These cover the highest-volume AI training sources.

Does blocking Google-Extended affect my Google search rankings?

Google says Google-Extended only affects AI training (Gemini, etc.) and not Google Search crawling. The regular Googlebot is unaffected. But monitor your traffic after adding the block to be safe.

What user-agent does Anthropic's Claude crawler use?

ClaudeBot is the user-agent for Anthropic's web crawler. There's also Claude-SearchBot for search-related crawling. Add both if you want comprehensive blocking.

Launch Your Own Clothing Brand — No Inventory, No Risk