Paste the same English paragraph into GPT, Claude, and Gemini and you get three different token counts. The variance is small (usually 5-10%) but it confuses people who expect tokenization to be standardized. Here is why each major LLM tokenizes differently — and when the difference matters.
Take this sentence: "The quick brown fox jumps over the lazy dog near the riverbank."
| Tokenizer | Token count |
|---|---|
| GPT-4o (o200k_base) | 13 |
| GPT-3.5 (cl100k_base) | 13 |
| Claude (Anthropic tokenizer) | 13 |
| Gemini (SentencePiece) | 14 |
| Llama (BPE) | 13 |
For this sentence, the difference is 1 token. Across an entire article, the total difference can be 5-15%. Across millions of API calls, that adds up.
1. Byte Pair Encoding (BPE) — used by GPT, DeepSeek, Llama. The tokenizer learns common subword pieces from training data. Frequent character pairs get merged into single tokens. Vocabulary size is typically 50K-200K tokens. Pros: efficient for English, handles unknown words by breaking them down. Cons: tokenization quality varies for non-English languages.
2. SentencePiece — used by Gemini, T5, and other Google models. Treats text as a raw byte stream and learns subword units. Includes whitespace as part of tokens. Pros: handles any language including those without word separators (Chinese, Japanese, Thai). Cons: slightly different per-language efficiency than BPE.
3. Custom variants — used by Claude. Anthropic uses a tokenizer derived from but distinct from BPE. Trained on Anthropic's own data mix. Pros: tuned for Anthropic's typical use cases. Cons: closed implementation makes exact reproduction outside Anthropic's tools difficult.
See approximate token counts that work across all major models.
Open Token Counter →A tokenizer's vocabulary is a learned dictionary of common pieces. Bigger vocabulary = each piece is more specific = fewer tokens per word for common content.
| Tokenizer | Vocabulary size |
|---|---|
| GPT-3.5 (cl100k_base) | 100,256 |
| GPT-4o (o200k_base) | 199,997 |
| Claude (recent versions) | ~150,000 |
| Gemini (SentencePiece) | ~256,000 |
| Llama 3 | 128,256 |
GPT-4o doubled the vocabulary size from GPT-3.5, which is why GPT-4o uses fewer tokens than GPT-3.5 for the same text. Larger vocabularies tend to be more token-efficient but increase model size and memory requirements.
The variance between tokenizers is small for normal English prose but grows for:
Code. Different tokenizers handle code differently. Python identifiers, JavaScript symbols, and SQL keywords tokenize differently. A 1,000-line Python file might be 4,500 tokens on GPT-4o and 5,200 tokens on Claude.
Non-English languages. Tokenizers trained on English-heavy data often use 2-3x more tokens for non-English text than English text. Gemini tends to be more efficient here because of broader multilingual training.
Special characters and emojis. Each tokenizer handles emojis, math symbols, and Unicode differently. An emoji might be 1 token on Gemini and 3 tokens on GPT-4o.
Rare names and jargon. Words not in the vocabulary get split into subwords. Rare medical terms, scientific notation, and unusual proper nouns can use 4-7 tokens each.
Here's what 1,000 words of different content looks like across tokenizers:
| Content | GPT-4o | Claude | Gemini | Llama |
|---|---|---|---|---|
| Plain English news article | 1,250 | 1,290 | 1,275 | 1,300 |
| Python code (1,000 LOC) | 4,500 | 4,300 | 4,100 | 4,400 |
| Spanish article | 1,400 | 1,450 | 1,350 | 1,420 |
| Chinese article | 2,100 | 2,300 | 1,800 | 2,200 |
| JSON output | 1,100 | 1,150 | 1,080 | 1,120 |
| Math equations (LaTeX) | 1,800 | 1,900 | 2,100 | 1,950 |
For English news, the variance is ~4%. For Chinese, it's ~28% — Gemini wins by a lot. For math, GPT wins. The "best" tokenizer depends on what you're sending.
For estimation: Pick any major tokenizer, multiply by 1.05-1.15 to cover the worst case across models. For most workloads this is accurate enough for budgeting.
For exact billing: Use the official tokenizer for your specific model. tiktoken for GPT, count_tokens API for Claude, count_tokens method for Gemini.
For multilingual workloads: Gemini's broader multilingual training usually means fewer tokens (and lower cost) for non-English content. If you process a lot of Asian languages, this can swing the cost decision.
For code workloads: Test the same code samples on each tokenizer you're considering. Differences of 10-20% are normal.
If you process 1 million queries per month and the variance between tokenizers is 10%, that's 100K extra tokens you didn't budget for. Multiplied by per-token pricing, that can be $5-50/month difference. Small for individuals, real at scale.
For most teams, the tokenizer differences are noise — pick the model that wins on quality and price, accept the small variance. Use the Token Counter for approximations and the official tokenizer when exact counts matter.
Get tokenizer-agnostic counts that work across all models.
Open Token Counter →