beginnerconcepts

What Is Tokenization?

How AI models break text into chunks called tokens — and why it matters for cost and limits.

What Is a Token?

Language models don't process text character by character or word by word — they process tokens, which are subword chunks produced by an algorithm called a tokenizer. Common English words like "the", "and", "run" are typically single tokens. Longer or rarer words get split: "tokenization" might be two tokens ("token" + "ization"), and "uncharacteristically" might be four or more. Numbers, punctuation, whitespace, and code all tokenize differently from natural language prose.

Subword Tokenization

Modern models use Byte-Pair Encoding (BPE) or similar algorithms that learn token boundaries from large text corpora. Frequent sequences become single tokens; rare sequences get split into smaller units. This means common English words have low token counts while technical jargon, foreign languages, and unusual strings have higher token counts per character. A 1,000-character Python snippet might tokenize to 300 tokens; a 1,000-character Russian text might tokenize to 600+ tokens because Cyrillic characters are less frequent in training data.

Why Unusual Words Cost More

Because rare or unusual words are broken into more subword pieces, they consume more tokens than simple words of similar length. This has cost implications: processing a document heavy in technical terminology, proper nouns, or non-English content costs proportionally more than processing plain English prose. Code with long variable names, camelCase identifiers, and lots of punctuation (brackets, semicolons) also tends to be token-dense.

Token Counting Tools

Before sending large inputs to the API, it's worth checking token counts. OpenAI provides the tiktoken Python library for GPT models. Anthropic provides anthropic.count_tokens(). Both model providers also have token counter tools in their playgrounds. This matters for two reasons: staying under context limits and accurately forecasting API costs before a production workload runs.

Example

'Hello world' = 2 tokens 'Uncharacteristically' = 6 tokens 1,000 tokens ≈ 750 words

Try this skill with our AI assistant

Try it →