How many tokens does a RAG query use?

Typical RAG query: 200-500 token system prompt + 1,500-4,000 tokens of retrieved context (3-8 chunks) + 50-200 token user question + 200-800 token answer. Total: 2,000-5,500 tokens per query, mostly input.

How much does RAG cost per query?

On GPT-4o mini, a typical RAG query costs about $0.0006 (less than $0.001). On GPT-4o, about $0.012. On Claude Sonnet 4, about $0.015. The cost scales linearly with chunk count and chunk size — larger context, higher cost.

How do I reduce RAG token cost?

Five tactics: (1) reduce retrieved chunks from 8 to 3-5, (2) use smaller chunks (300-500 tokens vs 800-1500), (3) rerank and drop low-relevance chunks before passing to the LLM, (4) cache the system prompt, (5) use a cheap model for the answer generation step.

Should I use a cheap or premium LLM for RAG?

For most RAG workloads, GPT-4o mini or Gemini Flash is sufficient — the heavy lifting is in retrieval, not generation. Upgrade to a premium model only if your chunks are noisy, your questions require multi-document synthesis, or hallucination tolerance is very low.

How to Budget Tokens for a RAG Pipeline (With Real Numbers)

Last updated: April 20268 min readAI Tools

RAG (Retrieval-Augmented Generation) eats input tokens. Every query stuffs the LLM with retrieved context, and at scale, those input tokens become the dominant line item on your AI bill. Here is exactly how to budget a RAG pipeline before you build it — and how to keep cost under control after you ship.

The RAG Cost Equation

A RAG query has four cost components:

System prompt — fixed across queries, 200-500 tokens
Retrieved chunks — variable, depends on chunk count × chunk size
User question — variable but small, 50-200 tokens
Generated answer — variable, 200-800 tokens

The retrieved chunks dominate. If you retrieve 5 chunks at 800 tokens each, that's 4,000 input tokens per query — typically 80%+ of your total token cost.

Token Budget by Configuration

Config	System	Chunks	Total input	Output	Total tokens
Lean (3 chunks × 400 tokens)	300	1,200	1,650	300	1,950
Standard (5 chunks × 800 tokens)	400	4,000	4,550	500	5,050
Heavy (8 chunks × 1,200 tokens)	500	9,600	10,250	800	11,050
Long-context (15 chunks × 2,000 tokens)	500	30,000	30,650	1,000	31,650

The lean config costs roughly 1/15th of the long-context config. For most production workloads, lean or standard is sufficient. Long-context is only worth it if your questions actually require synthesis across many documents.

Use the RAG preset in the calculator to see real costs across every model.

Open AI Cost Calculator →

Per-Query Cost Across Models

For the standard config (4,550 input + 500 output tokens):

Model	Per query	Per 1,000 queries
Gemini 2.0 Flash	$0.00066	$0.66
GPT-4.1 nano	$0.00066	$0.66
GPT-4o mini	$0.00098	$0.98
Gemini 2.5 Flash	$0.00098	$0.98
DeepSeek V3	$0.00177	$1.77
Claude Haiku 3.5	$0.00564	$5.64
Gemini 2.5 Pro	$0.01069	$10.69
GPT-4o	$0.01638	$16.38
Claude Sonnet 4	$0.02115	$21.15
Claude Opus 4	$0.10575	$105.75

RAG is one of the workloads where the cheap tier really shines. At 10,000 queries per day on GPT-4o mini, you spend $9.80/day. The same workload on Claude Opus 4 costs $1,058/day — over 100x more.

Five Ways to Reduce RAG Token Cost

1. Retrieve fewer chunks. Most RAG systems default to 5-10 chunks. Test with 3. Often quality is identical because the top 3 chunks contain the answer 90%+ of the time. Cutting from 8 to 3 chunks slashes input tokens by ~60%.

2. Use smaller chunks. Default chunk sizes are often 1000-1500 tokens. Try 400-600. Smaller chunks mean more precise retrieval and less noise per chunk. Quality often improves while cost drops.

3. Rerank and drop low-relevance chunks. Use a small reranker model (Cohere Rerank, BGE Reranker) to score retrieved chunks. Drop anything below a relevance threshold. The reranker call costs cents but eliminates 30-50% of the chunk tokens you would have passed to the LLM.

4. Cache the system prompt. Anthropic's prompt caching gives 90% input discount on cached prefix. OpenAI's automatic prompt caching gives 50%. If your system prompt is 500 tokens and you cache it, that's $0.0006 vs $0.0001 per query on GPT-4o. Tiny per query, real at scale.

5. Use a cheap model for generation. The generation step in RAG is usually the easy part — the LLM is just summarizing what is already in the context. Cheap models do this well. Save the premium models for prompts where the LLM has to reason beyond the context.

Real Workload Math

Let's walk through a real example: a customer support RAG system answering 5,000 queries per day with the standard config (5 chunks × 800 tokens, 500-token output).

Total per month: 150,000 queries × 5,050 tokens = 757.5M tokens. Input: 682.5M, Output: 75M.

Model	Monthly cost	Per query	Per active customer (10/mo)
Gemini 2.0 Flash	$98.25	$0.00066	$0.0066
GPT-4o mini	$147.38	$0.00098	$0.0098
Claude Haiku 3.5	$846.00	$0.00564	$0.056
GPT-4o	$2,456.25	$0.01638	$0.164
Claude Sonnet 4	$3,172.50	$0.02115	$0.212
Claude Opus 4	$15,862.50	$0.10575	$1.058

The spread is dramatic. On Gemini Flash, the entire RAG bill is $98/month — easy to hide in a $99/month per-customer SaaS plan. On Claude Opus, it's $15,862/month — impossible to bury in any reasonable plan.

The Default RAG Setup for 2026

Embed documents with text-embedding-3-small or BGE Large
Chunk size: 400-600 tokens
Retrieve top 5 by vector similarity
Rerank with Cohere Rerank, drop anything below 0.5
Generate on GPT-4o mini or Gemini 2.5 Flash
Cache the system prompt (90% input discount on Anthropic, 50% on OpenAI auto-caching)
Escalate only the prompts where quality fails to GPT-4.1 or Claude Sonnet

Run the numbers in the AI Cost Calculator using the RAG preset. Compare what you get vs your budget.

Calculate your RAG bill across every model in one click.

Open AI Cost Calculator →

How to Budget Tokens for a RAG Pipeline (With Real Numbers)

The RAG Cost Equation

Token Budget by Configuration

Per-Query Cost Across Models

Five Ways to Reduce RAG Token Cost

Real Workload Math

The Default RAG Setup for 2026

Related Posts

AI Cost Calculator

Token Counter

Estimate LLM Costs