Does robots.txt actually stop AI bots from scraping my site?

Robots.txt is a voluntary standard — a polite request, not a technical barrier. Well-behaved bots from major companies (OpenAI, Google, Anthropic) respect robots.txt rules. However, bad actors and smaller scrapers may ignore it entirely. Robots.txt is your first line of defense, not your only one.

What is GPTBot and should I block it?

GPTBot is the OpenAI web crawler that collects data to train and improve AI models like ChatGPT. If you do not want your content used in AI training data, block GPTBot. If you want your content to potentially appear in ChatGPT responses, keep it unblocked. The user-agent string is GPTBot.

What is CCBot and why does it matter for AI?

CCBot is the crawler for Common Crawl, a nonprofit that builds massive web archives. These archives are used as training data by many AI companies. Blocking CCBot prevents your content from entering the Common Crawl dataset, which indirectly reduces its use in AI training. User-agent: CCBot.

Will blocking Google-Extended affect my Google Search rankings?

No. Google-Extended is specifically for Gemini AI training, separate from Googlebot which handles Search. Blocking Google-Extended prevents your content from training Google AI models but does not affect your position in Google Search results.

Should I block all AI bots or just some?

It depends on your goals. Block all if you want zero AI training use. Block selectively if you want to appear in some AI-generated answers (like Google AI Overviews) but not others. Many publishers block training bots (GPTBot, CCBot) but allow retrieval bots that cite sources in answers.

Can I block AI bots without affecting SEO?

Yes. AI training bots (GPTBot, CCBot, Google-Extended, Bytespider) are separate from search crawlers (Googlebot, Bingbot). Blocking AI bots has zero impact on your search engine rankings. Just make sure you do not accidentally block Googlebot in the same rule.

How often do new AI bots appear?

New AI crawlers appear regularly as more companies build AI products. In 2025-2026 alone, several new bots emerged including PerplexityBot, Amazonbot, and the Meta crawler. Check your server access logs periodically to identify new bot user-agents visiting your site.

Is there a legal requirement for AI bots to respect robots.txt?

Currently there is no universal law requiring bots to obey robots.txt. However, several lawsuits and regulatory discussions are underway globally. The EU AI Act and various copyright rulings are moving toward requiring respect for robots.txt and similar opt-out mechanisms. The legal landscape is evolving rapidly.

How to Block AI Bots with Robots.txt — GPTBot, CCBot & Every AI Crawler (2026)

Last updated: April 20268 min readSEO Tools

AI companies are scraping the open web to train their models, and your content is likely in the training data. Robots.txt is the standard way to tell AI crawlers to stay away. Here is every AI bot user-agent you should know about, with the exact rules to block them.

Every AI Bot User-Agent You Should Know

Bot Name	Company	Purpose	User-Agent String
GPTBot	OpenAI	Training data for ChatGPT and GPT models	GPTBot
ChatGPT-User	OpenAI	ChatGPT browsing feature (retrieval, not training)	ChatGPT-User
CCBot	Common Crawl	Web archiving — datasets widely used for AI training	CCBot
Google-Extended	Google	Training data for Gemini AI models	Google-Extended
anthropic-ai	Anthropic	Training data for Claude models	anthropic-ai
ClaudeBot	Anthropic	Claude web browsing and retrieval	ClaudeBot
Bytespider	ByteDance	Training data for TikTok AI and ByteDance models	Bytespider
FacebookBot	Meta	Training data for Meta AI models	FacebookBot
PerplexityBot	Perplexity	AI search engine crawling and retrieval	PerplexityBot
Amazonbot	Amazon	Training data for Alexa and Amazon AI products	Amazonbot
Cohere-ai	Cohere	Training data for Cohere language models	cohere-ai
Applebot-Extended	Apple	Training data for Apple Intelligence features	Applebot-Extended

Block All AI Bots — Copy-Paste Rules

Add these rules to your robots.txt file. Each block targets one AI crawler:

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

The Honest Caveat: Robots.txt Is a Suggestion

This is the part most guides skip. Robots.txt is a voluntary protocol. There is no technical mechanism in robots.txt that physically prevents a bot from accessing your pages. It works like a "staff only" sign — polite visitors respect it, but there is no lock on the door.

Major companies (OpenAI, Google, Anthropic, Meta) — publicly committed to respecting robots.txt. They have too much legal and reputational risk not to.
Smaller AI startups — compliance varies. Some respect it, some do not.
Bad actors and unlabeled scrapers — may use generic user-agents or no identifier at all, bypassing any robots.txt rules entirely.

Robots.txt is necessary but not sufficient. For stronger protection, consider server-side rate limiting, bot detection services, and monitoring your access logs for unusual crawling patterns.

When to Block vs. When to Allow

Scenario	Recommendation	Why
You publish original content and want to protect copyright	Block all AI training bots	Prevents your content from entering training datasets
You want to appear in AI-generated answers	Allow retrieval bots (ChatGPT-User, PerplexityBot)	These bots fetch content to cite in answers — blocking them removes you from AI search
You want zero AI involvement	Block everything	Maximum protection, but you disappear from AI-powered search entirely
You run an e-commerce store	Block training bots, allow retrieval bots	Product descriptions in training data help no one; appearing in AI shopping answers helps you
You have a news or media site	Block training bots at minimum	Your journalism should not train competing AI summaries for free
You want to appear in Google AI Overviews	Keep Googlebot and Google-Extended allowed	Google AI Overviews pull from indexed content; blocking may remove you from these features

Training Bots vs. Retrieval Bots

There is an important distinction most people miss:

Training bots (GPTBot, CCBot, Google-Extended, Bytespider) — scrape content to train AI models. Your content becomes part of the model weights. You get nothing in return.
Retrieval bots (ChatGPT-User, PerplexityBot, ClaudeBot) — fetch content in real time to display in AI-generated answers, often with a link back to your site. This is closer to search engine behavior.

Many publishers block training bots but allow retrieval bots. This protects intellectual property while maintaining visibility in AI-powered search results. It is the most balanced approach for most sites.

How to Check Your Current Robots.txt

Visit yoursite.com/robots.txt in your browser
If you see a file, check whether any AI bot user-agents are already listed
If you see a 404, you have no robots.txt — crawlers can access everything
Use our Robots.txt Generator to create or update your file with the AI bot rules included

Related Tools

Robots.txt Generator — build your robots.txt with AI bot blocking rules included
Meta Tag Generator — add noindex tags for pages that need full search exclusion
Open Graph Checker — verify social sharing metadata across your site
Question Finder — find what people are asking about AI scraping and content protection
Headline Analyzer — optimize titles for the content you are protecting
Readability Scorer — ensure your content is clear and well-structured

Generate robots.txt rules that block AI scrapers — all major bots covered.

Open Robots.txt Generator

How to Block AI Bots with Robots.txt — GPTBot, CCBot & Every AI Crawler (2026)

Every AI Bot User-Agent You Should Know

Block All AI Bots — Copy-Paste Rules

The Honest Caveat: Robots.txt Is a Suggestion

When to Block vs. When to Allow

Training Bots vs. Retrieval Bots

How to Check Your Current Robots.txt

Related Tools

Related Posts

Robots.txt Complete Guide

Best Robots.txt for WordPress

What Is Robots.txt?

Robots.txt Syntax Reference