Blog
Custom Print on Demand Apparel — Free Storefront for Your Business
Wild & Free Tools

How to Budget Tokens for a RAG Pipeline (With Real Numbers)

Last updated: April 20268 min readAI Tools

RAG (Retrieval-Augmented Generation) eats input tokens. Every query stuffs the LLM with retrieved context, and at scale, those input tokens become the dominant line item on your AI bill. Here is exactly how to budget a RAG pipeline before you build it — and how to keep cost under control after you ship.

The RAG Cost Equation

A RAG query has four cost components:

  1. System prompt — fixed across queries, 200-500 tokens
  2. Retrieved chunks — variable, depends on chunk count × chunk size
  3. User question — variable but small, 50-200 tokens
  4. Generated answer — variable, 200-800 tokens

The retrieved chunks dominate. If you retrieve 5 chunks at 800 tokens each, that's 4,000 input tokens per query — typically 80%+ of your total token cost.

Token Budget by Configuration

ConfigSystemChunksTotal inputOutputTotal tokens
Lean (3 chunks × 400 tokens)3001,2001,6503001,950
Standard (5 chunks × 800 tokens)4004,0004,5505005,050
Heavy (8 chunks × 1,200 tokens)5009,60010,25080011,050
Long-context (15 chunks × 2,000 tokens)50030,00030,6501,00031,650

The lean config costs roughly 1/15th of the long-context config. For most production workloads, lean or standard is sufficient. Long-context is only worth it if your questions actually require synthesis across many documents.

Use the RAG preset in the calculator to see real costs across every model.

Open AI Cost Calculator →

Per-Query Cost Across Models

For the standard config (4,550 input + 500 output tokens):

ModelPer queryPer 1,000 queries
Gemini 2.0 Flash$0.00066$0.66
GPT-4.1 nano$0.00066$0.66
GPT-4o mini$0.00098$0.98
Gemini 2.5 Flash$0.00098$0.98
DeepSeek V3$0.00177$1.77
Claude Haiku 3.5$0.00564$5.64
Gemini 2.5 Pro$0.01069$10.69
GPT-4o$0.01638$16.38
Claude Sonnet 4$0.02115$21.15
Claude Opus 4$0.10575$105.75

RAG is one of the workloads where the cheap tier really shines. At 10,000 queries per day on GPT-4o mini, you spend $9.80/day. The same workload on Claude Opus 4 costs $1,058/day — over 100x more.

Five Ways to Reduce RAG Token Cost

1. Retrieve fewer chunks. Most RAG systems default to 5-10 chunks. Test with 3. Often quality is identical because the top 3 chunks contain the answer 90%+ of the time. Cutting from 8 to 3 chunks slashes input tokens by ~60%.

2. Use smaller chunks. Default chunk sizes are often 1000-1500 tokens. Try 400-600. Smaller chunks mean more precise retrieval and less noise per chunk. Quality often improves while cost drops.

3. Rerank and drop low-relevance chunks. Use a small reranker model (Cohere Rerank, BGE Reranker) to score retrieved chunks. Drop anything below a relevance threshold. The reranker call costs cents but eliminates 30-50% of the chunk tokens you would have passed to the LLM.

4. Cache the system prompt. Anthropic's prompt caching gives 90% input discount on cached prefix. OpenAI's automatic prompt caching gives 50%. If your system prompt is 500 tokens and you cache it, that's $0.0006 vs $0.0001 per query on GPT-4o. Tiny per query, real at scale.

5. Use a cheap model for generation. The generation step in RAG is usually the easy part — the LLM is just summarizing what is already in the context. Cheap models do this well. Save the premium models for prompts where the LLM has to reason beyond the context.

Real Workload Math

Let's walk through a real example: a customer support RAG system answering 5,000 queries per day with the standard config (5 chunks × 800 tokens, 500-token output).

Total per month: 150,000 queries × 5,050 tokens = 757.5M tokens. Input: 682.5M, Output: 75M.

ModelMonthly costPer queryPer active customer (10/mo)
Gemini 2.0 Flash$98.25$0.00066$0.0066
GPT-4o mini$147.38$0.00098$0.0098
Claude Haiku 3.5$846.00$0.00564$0.056
GPT-4o$2,456.25$0.01638$0.164
Claude Sonnet 4$3,172.50$0.02115$0.212
Claude Opus 4$15,862.50$0.10575$1.058

The spread is dramatic. On Gemini Flash, the entire RAG bill is $98/month — easy to hide in a $99/month per-customer SaaS plan. On Claude Opus, it's $15,862/month — impossible to bury in any reasonable plan.

The Default RAG Setup for 2026

  1. Embed documents with text-embedding-3-small or BGE Large
  2. Chunk size: 400-600 tokens
  3. Retrieve top 5 by vector similarity
  4. Rerank with Cohere Rerank, drop anything below 0.5
  5. Generate on GPT-4o mini or Gemini 2.5 Flash
  6. Cache the system prompt (90% input discount on Anthropic, 50% on OpenAI auto-caching)
  7. Escalate only the prompts where quality fails to GPT-4.1 or Claude Sonnet

Run the numbers in the AI Cost Calculator using the RAG preset. Compare what you get vs your budget.

Calculate your RAG bill across every model in one click.

Open AI Cost Calculator →
Launch Your Own Clothing Brand — No Inventory, No Risk