AI companies are scraping the open web to train their models, and your content is likely in the training data. Robots.txt is the standard way to tell AI crawlers to stay away. Here is every AI bot user-agent you should know about, with the exact rules to block them.
| Bot Name | Company | Purpose | User-Agent String |
|---|---|---|---|
| GPTBot | OpenAI | Training data for ChatGPT and GPT models | GPTBot |
| ChatGPT-User | OpenAI | ChatGPT browsing feature (retrieval, not training) | ChatGPT-User |
| CCBot | Common Crawl | Web archiving — datasets widely used for AI training | CCBot |
| Google-Extended | Training data for Gemini AI models | Google-Extended | |
| anthropic-ai | Anthropic | Training data for Claude models | anthropic-ai |
| ClaudeBot | Anthropic | Claude web browsing and retrieval | ClaudeBot |
| Bytespider | ByteDance | Training data for TikTok AI and ByteDance models | Bytespider |
| FacebookBot | Meta | Training data for Meta AI models | FacebookBot |
| PerplexityBot | Perplexity | AI search engine crawling and retrieval | PerplexityBot |
| Amazonbot | Amazon | Training data for Alexa and Amazon AI products | Amazonbot |
| Cohere-ai | Cohere | Training data for Cohere language models | cohere-ai |
| Applebot-Extended | Apple | Training data for Apple Intelligence features | Applebot-Extended |
Add these rules to your robots.txt file. Each block targets one AI crawler:
# Block AI training bots User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: anthropic-ai Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Bytespider Disallow: / User-agent: FacebookBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Amazonbot Disallow: / User-agent: cohere-ai Disallow: / User-agent: Applebot-Extended Disallow: /
This is the part most guides skip. Robots.txt is a voluntary protocol. There is no technical mechanism in robots.txt that physically prevents a bot from accessing your pages. It works like a "staff only" sign — polite visitors respect it, but there is no lock on the door.
Robots.txt is necessary but not sufficient. For stronger protection, consider server-side rate limiting, bot detection services, and monitoring your access logs for unusual crawling patterns.
| Scenario | Recommendation | Why |
|---|---|---|
| You publish original content and want to protect copyright | Block all AI training bots | Prevents your content from entering training datasets |
| You want to appear in AI-generated answers | Allow retrieval bots (ChatGPT-User, PerplexityBot) | These bots fetch content to cite in answers — blocking them removes you from AI search |
| You want zero AI involvement | Block everything | Maximum protection, but you disappear from AI-powered search entirely |
| You run an e-commerce store | Block training bots, allow retrieval bots | Product descriptions in training data help no one; appearing in AI shopping answers helps you |
| You have a news or media site | Block training bots at minimum | Your journalism should not train competing AI summaries for free |
| You want to appear in Google AI Overviews | Keep Googlebot and Google-Extended allowed | Google AI Overviews pull from indexed content; blocking may remove you from these features |
There is an important distinction most people miss:
Many publishers block training bots but allow retrieval bots. This protects intellectual property while maintaining visibility in AI-powered search results. It is the most balanced approach for most sites.
yoursite.com/robots.txt in your browserGenerate robots.txt rules that block AI scrapers — all major bots covered.
Open Robots.txt Generator