Under the Hood of AI: How Tokenization Works in Large Language Models

What Exactly Is a Token?

A token is the smallest chunk of text an LLM actually sees. And here's what trips up most developers the first time: a token is not a word. Depending on the tokenizer, a token could be a full word ("hello"), a subword fragment ("un" + "forgett" + "able"), a single character, or even a raw byte sequence.

The reason LLMs don't just split on whitespace is both mathematical and practical. A word-level tokenizer would need a vocabulary containing every possible word in every language — an impossibly large lookup table. Conversely, a character-level tokenizer would produce extremely long sequences, making the self-attention mechanism in Transformer architectures prohibitively expensive (attention scales quadratically with sequence length). Subword tokenization hits the sweet spot: a manageable vocabulary size (typically 30,000 to 100,000 tokens) that can represent any text with reasonably short sequences.

For English text, the rough heuristic is about 1 token per 4 characters or 1.3 tokens per word. But this ratio varies wildly: code tends to use 2–3x more tokens per character than prose, CJK (Chinese, Japanese, Korean) characters often consume 2–3 tokens each, and highly technical vocabulary gets split into more subword pieces than common words.

Byte-Pair Encoding: The Algorithm Behind GPT

The tokenizer behind every GPT model is Byte-Pair Encoding (BPE) — originally a data compression trick from 1994, repurposed for NLP by Sennrich et al. in 2016. GPT-2, GPT-3, GPT-4 all run on variants of it.

The algorithm is dead simple, which is part of why it works so well:

Initialize the vocabulary with all individual bytes (or characters) present in the training corpus. For a byte-level BPE tokenizer, this starts with 256 base tokens.
Count all adjacent pairs of tokens in the corpus. For example, if "th" appears 1 million times and "he" appears 800,000 times, "th" is the most frequent pair.
Merge the most frequent pair into a new single token. "t" + "h" becomes "th", and this new token is added to the vocabulary.
Repeat steps 2–3 until the vocabulary reaches the target size. GPT-4's tokenizer (cl100k_base) has approximately 100,000 tokens.

What falls out of this process is surprisingly elegant: common words like "the", "and", and "function" naturally become single tokens through repeated merging. Rare words get split into familiar subword pieces. "unforgettable" might tokenize as ["un", "forget", "table"] because each of those chunks showed up enough in the training data to earn its own slot in the vocabulary.

Try it yourself: Paste text into our Token Counter tool and watch how the token count changes as you add or remove words. Notice how common English words barely add tokens, while code and special characters inflate the count dramatically.

SentencePiece, WordPiece, and Other Tokenizers

While BPE dominates the GPT ecosystem, other tokenization algorithms serve different models and use cases:

WordPiece (used by Google's BERT and early Transformer models) works similarly to BPE but selects merges based on a likelihood criterion rather than raw frequency. It maximizes the probability of the training corpus under a unigram language model, which can produce slightly different vocabularies than BPE.
SentencePiece (used by Google's T5, Meta's LLaMA, and Mistral) is a language-agnostic tokenizer that treats the input as a raw byte stream rather than pre-tokenized words. This makes it especially effective for multilingual models because it doesn't assume whitespace-separated words — critical for languages like Japanese, Chinese, and Thai.
Unigram (a variant within SentencePiece) starts with a large vocabulary and iteratively prunes tokens that contribute least to the corpus likelihood, which is the opposite direction from BPE's bottom-up merging.

Anthropic's Claude uses its own proprietary tokenizer, but the principles are similar. The key takeaway is that different models tokenize the same text differently — "Hello, world!" might be 4 tokens on GPT-4 but 3 tokens on Claude. This is why token count estimates are always approximations unless you use the exact tokenizer for the target model.

Why Tokenization Matters for Developers

Understanding tokenization isn't just academic curiosity — it has direct, measurable impact on your AI applications:

Cost Management

LLM APIs charge per token, not per word or character. GPT-4 Turbo costs $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens. If your system prompt is 500 tokens and you make 10,000 API calls per day, that system prompt alone costs $50/day. Trimming it to 300 tokens saves $20/day — $600/month — without changing functionality.

Context Window Limits

Every model has a maximum context window: GPT-4 Turbo supports 128K tokens, Claude 3 Opus supports 200K, GPT-3.5 Turbo is limited to 16K. Your input tokens plus output tokens must fit within this window. In RAG (Retrieval-Augmented Generation) applications, you need to carefully budget how many document chunks you can inject alongside your system prompt and user query.

Code vs. Prose Efficiency

Code is significantly more token-dense than natural language. A JSON object with 10 key-value pairs might consume 60+ tokens due to quotation marks, colons, commas, and braces. When sending structured data to an LLM, consider whether YAML or even plain-text key-value pairs would use fewer tokens while conveying the same information.

Multilingual Considerations

Non-Latin scripts are typically less efficiently tokenized because the BPE training corpus is disproportionately English. A Chinese sentence of 20 characters might consume 40–60 tokens, while an English sentence of similar meaning might use only 15–20 tokens. This has real cost implications for multilingual applications.

Practical Tokenization Tips

Armed with an understanding of how tokenization works, here are actionable strategies to optimize your AI workflows:

Pre-flight your prompts: Always estimate token count before sending expensive API calls. Use a tool like our Token Counter to check that your input + expected output fits within the context window.
Be concise in system prompts: Your system prompt is sent with every single request. Even small reductions compound into significant cost savings at scale. Aim for the minimum viable instructions.
Use structured output modes: Many APIs now support JSON mode or function calling, which constrains the model's output format and often produces shorter, more predictable responses.
Implement conversation summarization: For chatbots, periodically summarize older messages instead of sending the entire conversation history. This keeps the context window fresh without losing important context.
Choose the right model: Not every task needs GPT-4. For classification, extraction, and simple generation tasks, GPT-3.5 Turbo or Claude Haiku are 10–60x cheaper and often produce comparable results.
Batch similar requests: If you're processing multiple items, batch them into a single prompt where possible. The shared system prompt overhead is amortized across all items.

The Future of Tokenization

Tokenization is an active area of research. Current limitations — inefficient encoding of non-Latin scripts, sensitivity to typos and casing, and the fixed vocabulary problem — are driving innovation in several directions:

• Byte-level models like ByT5 skip tokenization entirely and process raw bytes, eliminating vocabulary limitations at the cost of longer sequences.
• Dynamic tokenization adapts the vocabulary based on the input domain, using a general-purpose vocabulary for prose but switching to a code-optimized vocabulary for programming tasks.
• Tokenizer-free architectures like MegaByte propose processing fixed-size byte patches, potentially offering the best of both worlds: efficient training and universal language support.

Until any of this ships in production models, BPE and its cousins are what you're actually working with. Knowing how they behave gives you an edge when designing prompts, budgeting API costs, and debugging context window issues — and our Token Counter lets you put that knowledge to work instantly.

Ready to count your tokens?

Put this knowledge into practice with our free, privacy-first Token Counter. Paste any text to instantly see estimated token counts and API costs across GPT-4, Claude, and more.

Try Token Counter