Why do different LLMs count tokens differently?

Each LLM uses a different tokenizer with a different vocabulary trained on different data. GPT uses BPE-based tokenizers (cl100k_base, o200k_base). Claude uses its own variant. Gemini uses SentencePiece. The same text can produce token counts that differ by 5-15% across models.

Which LLM is most token-efficient?

For English text, all major models are within 5% of each other. For non-English text, Gemini tends to be slightly more efficient because of broader multilingual training. For code, the differences are larger (10-20%) and depend on the specific languages involved.

Does tokenization affect quality?

Indirectly. Models perform better on text that splits cleanly into common tokens than text that produces many rare or unusual tokens. This is why models often handle common languages and topics better than rare languages or specialized jargon — the tokenization is more efficient.

Should I worry about tokenization differences?

For budgeting and cost estimation, the 5-10% variance is small enough that approximate counts work. For exact production billing or context window management, use the model-specific tokenizer in your code. The browser counter is fine for everything else.

Why GPT, Claude, and Gemini Tokenize the Same Text Differently

Last updated: April 20266 min readAI Tools

Paste the same English paragraph into GPT, Claude, and Gemini and you get three different token counts. The variance is small (usually 5-10%) but it confuses people who expect tokenization to be standardized. Here is why each major LLM tokenizes differently — and when the difference matters.

Quick Example

Take this sentence: "The quick brown fox jumps over the lazy dog near the riverbank."

Tokenizer	Token count
GPT-4o (o200k_base)	13
GPT-3.5 (cl100k_base)	13
Claude (Anthropic tokenizer)	13
Gemini (SentencePiece)	14
Llama (BPE)	13

For this sentence, the difference is 1 token. Across an entire article, the total difference can be 5-15%. Across millions of API calls, that adds up.

The Three Tokenizer Families

1. Byte Pair Encoding (BPE) — used by GPT, DeepSeek, Llama. The tokenizer learns common subword pieces from training data. Frequent character pairs get merged into single tokens. Vocabulary size is typically 50K-200K tokens. Pros: efficient for English, handles unknown words by breaking them down. Cons: tokenization quality varies for non-English languages.

2. SentencePiece — used by Gemini, T5, and other Google models. Treats text as a raw byte stream and learns subword units. Includes whitespace as part of tokens. Pros: handles any language including those without word separators (Chinese, Japanese, Thai). Cons: slightly different per-language efficiency than BPE.

3. Custom variants — used by Claude. Anthropic uses a tokenizer derived from but distinct from BPE. Trained on Anthropic's own data mix. Pros: tuned for Anthropic's typical use cases. Cons: closed implementation makes exact reproduction outside Anthropic's tools difficult.

See approximate token counts that work across all major models.

Open Token Counter →

Why Vocabulary Size Matters

A tokenizer's vocabulary is a learned dictionary of common pieces. Bigger vocabulary = each piece is more specific = fewer tokens per word for common content.

Tokenizer	Vocabulary size
GPT-3.5 (cl100k_base)	100,256
GPT-4o (o200k_base)	199,997
Claude (recent versions)	~150,000
Gemini (SentencePiece)	~256,000
Llama 3	128,256

GPT-4o doubled the vocabulary size from GPT-3.5, which is why GPT-4o uses fewer tokens than GPT-3.5 for the same text. Larger vocabularies tend to be more token-efficient but increase model size and memory requirements.

When Tokenizers Diverge the Most

The variance between tokenizers is small for normal English prose but grows for:

Code. Different tokenizers handle code differently. Python identifiers, JavaScript symbols, and SQL keywords tokenize differently. A 1,000-line Python file might be 4,500 tokens on GPT-4o and 5,200 tokens on Claude.

Non-English languages. Tokenizers trained on English-heavy data often use 2-3x more tokens for non-English text than English text. Gemini tends to be more efficient here because of broader multilingual training.

Special characters and emojis. Each tokenizer handles emojis, math symbols, and Unicode differently. An emoji might be 1 token on Gemini and 3 tokens on GPT-4o.

Rare names and jargon. Words not in the vocabulary get split into subwords. Rare medical terms, scientific notation, and unusual proper nouns can use 4-7 tokens each.

Real Variance Numbers

Here's what 1,000 words of different content looks like across tokenizers:

Content	GPT-4o	Claude	Gemini	Llama
Plain English news article	1,250	1,290	1,275	1,300
Python code (1,000 LOC)	4,500	4,300	4,100	4,400
Spanish article	1,400	1,450	1,350	1,420
Chinese article	2,100	2,300	1,800	2,200
JSON output	1,100	1,150	1,080	1,120
Math equations (LaTeX)	1,800	1,900	2,100	1,950

For English news, the variance is ~4%. For Chinese, it's ~28% — Gemini wins by a lot. For math, GPT wins. The "best" tokenizer depends on what you're sending.

What This Means for Developers

For estimation: Pick any major tokenizer, multiply by 1.05-1.15 to cover the worst case across models. For most workloads this is accurate enough for budgeting.

For exact billing: Use the official tokenizer for your specific model. tiktoken for GPT, count_tokens API for Claude, count_tokens method for Gemini.

For multilingual workloads: Gemini's broader multilingual training usually means fewer tokens (and lower cost) for non-English content. If you process a lot of Asian languages, this can swing the cost decision.

For code workloads: Test the same code samples on each tokenizer you're considering. Differences of 10-20% are normal.

Why This Matters for the API Bill

If you process 1 million queries per month and the variance between tokenizers is 10%, that's 100K extra tokens you didn't budget for. Multiplied by per-token pricing, that can be $5-50/month difference. Small for individuals, real at scale.

For most teams, the tokenizer differences are noise — pick the model that wins on quality and price, accept the small variance. Use the Token Counter for approximations and the official tokenizer when exact counts matter.

Get tokenizer-agnostic counts that work across all models.

Open Token Counter →

Why GPT, Claude, and Gemini Tokenize the Same Text Differently

Quick Example

The Three Tokenizer Families

Why Vocabulary Size Matters

When Tokenizers Diverge the Most

Real Variance Numbers

What This Means for Developers

Why This Matters for the API Bill

Related Posts

Token Counter

Tokens Per Word

LLM Context Windows