Most LLM prompts are 30-60% longer than they need to be. The extra tokens cost money, slow responses, and sometimes hurt quality by burying the important parts in noise. Here are eight tactics that consistently reduce token count without making the output worse.
The average system prompt has 30-50% filler. Words like "please," "kindly," "I would like you to," and verbose role descriptions add up.
Before (78 tokens):
You are a helpful and knowledgeable customer support agent for our software company. Your role is to assist users with their questions in a friendly and professional manner. Please respond clearly and try to be as helpful as possible.
After (24 tokens):
You are a customer support agent for SoftCo. Answer questions clearly and accurately.
Same instruction, 70% fewer tokens. The model doesn't need "please" or "kindly" — it's a model. Remove anything that doesn't change the output.
Test before/after token counts in seconds.
Open Token Counter →Most chatbots include the full conversation history with every message. This is the single biggest waste of tokens in production AI systems.
For most chats, only the last 5-10 turns matter for context. Older turns can be:
A typical 30-turn conversation can shrink from 15,000 tokens of history to 3,000 tokens of recent history + a 200-token summary. That's 80% reduction.
Lists and structured data use fewer tokens than prose explanations.
Before (45 tokens):
The user wants to book a flight from New York to Los Angeles on Friday March 15th at around 10am, preferably with one stop or non-stop, and they have a budget of about $400.
After (24 tokens):
Booking: NYC → LAX, Fri Mar 15 10am, 1 stop max, budget $400
Same information, ~50% fewer tokens. Models parse structured data well.
Few-shot prompting (giving the model examples) is powerful but expensive. Each example you include costs tokens. Test how many examples you actually need.
Common pattern: prompts include 5-10 examples when 2-3 would work. Removing 5 examples can save 500-2,000 tokens per call. Across thousands of calls, that's real money.
Test: run your prompt with 5 examples, then 3, then 1. If quality stays the same, drop the extras.
Retrieved context for RAG is often 60-80% of input tokens. Three ways to cut it:
Combined, these can cut RAG context from 8,000 tokens to 2,500 tokens per query — usually with no quality loss.
Output tokens are usually 3-5x more expensive than input tokens. If you don't need a long response, cap it.
For most chatbot responses, max_tokens of 300-500 is plenty. For Q&A, 100-200 is often enough. For summarization, set a target word count and cap accordingly.
Without a cap, models will sometimes write essays in response to simple questions. The cap prevents this and makes cost predictable.
If your prompt includes long background context (company description, product documentation, prior conversation), most of it is repetitive across queries. Replace it with:
For static content, the summary can be hand-tuned once and reused millions of times. Token savings compound with every API call.
If part of your prompt never changes (system message, persona, fixed context), use prompt caching. Both Anthropic and OpenAI offer it.
This isn't strictly "fewer tokens" — you still send them — but the cached portion is much cheaper. For chatbots with a fixed system prompt, this is the single largest cost reduction available.
Combining these tactics on a typical chatbot:
| Component | Before | After | Reduction |
|---|---|---|---|
| System prompt | 800 | 300 | -63% |
| Chat history (10 turns) | 5,000 | 1,500 | -70% |
| RAG context | 6,000 | 2,500 | -58% |
| User message | 200 | 200 | 0% |
| Total input | 12,000 | 4,500 | -63% |
| Output (max_tokens cap) | 800 | 400 | -50% |
| Per-request total | 12,800 | 4,900 | -62% |
62% reduction per request. At 10,000 requests per day on GPT-4o, that's $24/day savings or $720/month. Hours of work saves weeks of compute spending.
Measure your prompt before and after optimization.
Open Token Counter →