RAG (Retrieval-Augmented Generation) eats input tokens. Every query stuffs the LLM with retrieved context, and at scale, those input tokens become the dominant line item on your AI bill. Here is exactly how to budget a RAG pipeline before you build it — and how to keep cost under control after you ship.
A RAG query has four cost components:
The retrieved chunks dominate. If you retrieve 5 chunks at 800 tokens each, that's 4,000 input tokens per query — typically 80%+ of your total token cost.
| Config | System | Chunks | Total input | Output | Total tokens |
|---|---|---|---|---|---|
| Lean (3 chunks × 400 tokens) | 300 | 1,200 | 1,650 | 300 | 1,950 |
| Standard (5 chunks × 800 tokens) | 400 | 4,000 | 4,550 | 500 | 5,050 |
| Heavy (8 chunks × 1,200 tokens) | 500 | 9,600 | 10,250 | 800 | 11,050 |
| Long-context (15 chunks × 2,000 tokens) | 500 | 30,000 | 30,650 | 1,000 | 31,650 |
The lean config costs roughly 1/15th of the long-context config. For most production workloads, lean or standard is sufficient. Long-context is only worth it if your questions actually require synthesis across many documents.
Use the RAG preset in the calculator to see real costs across every model.
Open AI Cost Calculator →For the standard config (4,550 input + 500 output tokens):
| Model | Per query | Per 1,000 queries |
|---|---|---|
| Gemini 2.0 Flash | $0.00066 | $0.66 |
| GPT-4.1 nano | $0.00066 | $0.66 |
| GPT-4o mini | $0.00098 | $0.98 |
| Gemini 2.5 Flash | $0.00098 | $0.98 |
| DeepSeek V3 | $0.00177 | $1.77 |
| Claude Haiku 3.5 | $0.00564 | $5.64 |
| Gemini 2.5 Pro | $0.01069 | $10.69 |
| GPT-4o | $0.01638 | $16.38 |
| Claude Sonnet 4 | $0.02115 | $21.15 |
| Claude Opus 4 | $0.10575 | $105.75 |
RAG is one of the workloads where the cheap tier really shines. At 10,000 queries per day on GPT-4o mini, you spend $9.80/day. The same workload on Claude Opus 4 costs $1,058/day — over 100x more.
1. Retrieve fewer chunks. Most RAG systems default to 5-10 chunks. Test with 3. Often quality is identical because the top 3 chunks contain the answer 90%+ of the time. Cutting from 8 to 3 chunks slashes input tokens by ~60%.
2. Use smaller chunks. Default chunk sizes are often 1000-1500 tokens. Try 400-600. Smaller chunks mean more precise retrieval and less noise per chunk. Quality often improves while cost drops.
3. Rerank and drop low-relevance chunks. Use a small reranker model (Cohere Rerank, BGE Reranker) to score retrieved chunks. Drop anything below a relevance threshold. The reranker call costs cents but eliminates 30-50% of the chunk tokens you would have passed to the LLM.
4. Cache the system prompt. Anthropic's prompt caching gives 90% input discount on cached prefix. OpenAI's automatic prompt caching gives 50%. If your system prompt is 500 tokens and you cache it, that's $0.0006 vs $0.0001 per query on GPT-4o. Tiny per query, real at scale.
5. Use a cheap model for generation. The generation step in RAG is usually the easy part — the LLM is just summarizing what is already in the context. Cheap models do this well. Save the premium models for prompts where the LLM has to reason beyond the context.
Let's walk through a real example: a customer support RAG system answering 5,000 queries per day with the standard config (5 chunks × 800 tokens, 500-token output).
Total per month: 150,000 queries × 5,050 tokens = 757.5M tokens. Input: 682.5M, Output: 75M.
| Model | Monthly cost | Per query | Per active customer (10/mo) |
|---|---|---|---|
| Gemini 2.0 Flash | $98.25 | $0.00066 | $0.0066 |
| GPT-4o mini | $147.38 | $0.00098 | $0.0098 |
| Claude Haiku 3.5 | $846.00 | $0.00564 | $0.056 |
| GPT-4o | $2,456.25 | $0.01638 | $0.164 |
| Claude Sonnet 4 | $3,172.50 | $0.02115 | $0.212 |
| Claude Opus 4 | $15,862.50 | $0.10575 | $1.058 |
The spread is dramatic. On Gemini Flash, the entire RAG bill is $98/month — easy to hide in a $99/month per-customer SaaS plan. On Claude Opus, it's $15,862/month — impossible to bury in any reasonable plan.
Run the numbers in the AI Cost Calculator using the RAG preset. Compare what you get vs your budget.
Calculate your RAG bill across every model in one click.
Open AI Cost Calculator →