Per-token pricing is misleading. A model that's 5x cheaper per token can cost more end-to-end if it generates 10x more output to answer the same question. Here are the four hidden multipliers that turn "cheap" LLMs into expensive ones — and how to measure your real cost per task instead of cost per token.
Cheap models tend to write more. They hedge ("It depends on a number of factors..."), they explain (when you didn't ask for explanation), and they pad. Premium models are usually more concise out of the box.
Real example: ask GPT-4o mini and GPT-4.1 the same question — "What is the capital of Australia?"
Per-token, GPT-4o mini is 13x cheaper. But for this question, it used 42x more output tokens. End-to-end cost per task: GPT-4o mini wins by ~3x. But for hundreds of similar questions, the gap shrinks. For tasks where verbosity is more pronounced (multi-part questions, technical content), the verbosity multiplier can flip the math entirely.
Cheap models fail more often:
Each failure means a retry, often with a clarifying prompt that's longer than the original. If a cheap model has a 15% retry rate and a premium model has a 3% retry rate, the cheap model is paying for ~12% extra calls. With longer retry prompts (often 1.5x the original), the effective cost per successful task is meaningfully higher than the headline price.
Compare real cost per task across models — not just per-token.
Open AI Cost Calculator →Some teams use cheap models for the heavy lifting and then run a premium model as a "cleanup" pass on the output. The premium model verifies, fixes, or rewrites the cheap output. This pattern can work — but it doubles your cost per task if you're not careful.
If your "cheap" workflow is actually:
You're paying for cheap generation + premium review. End-to-end, this can cost more than just running the premium model in the first place. The premium model directly gets the output right ~80% of the time, and you skip the cheap model entirely.
Cheap models sometimes can't handle a complex task in one shot. To work around this, teams break the task into multiple smaller calls:
That's 4 API calls and 4,500 tokens to do what a premium model would do in one call with 2,500 tokens. If the premium model is 3x more expensive per token but you only need 55% as many tokens and 25% as many calls, the premium model actually wins.
The result usually surprises people. The "cheap" model is often 1.5-3x cheaper end-to-end, not 10x. And for some tasks, the premium model is actually cheaper.
For these tasks, cheap models are reliably cheaper end-to-end:
For these tasks, premium models can be cheaper end-to-end:
Pick your 5 most common task types. Run 50 examples of each through GPT-4o mini and GPT-4.1 (or your cheap and premium options). Compare end-to-end cost per successful task. The answer for your specific workload will be: cheap wins on some tasks, premium wins on others. Route accordingly.
Use the AI Cost Calculator to see baseline per-token costs, then add your retry rate and verbosity multiplier to estimate real cost per task.
Compare real per-task cost across every model.
Open AI Cost Calculator →