How LLMs Actually Work

Tokens, transformers, training, and inference — technically honest without being overwhelming.

The Architecture Every Major Model Shares

GPT-4, Claude, Gemini, Llama — they're all built on the same foundational architecture: the transformer, introduced by Google in the 2017 paper "Attention Is All You Need." Understanding what a transformer does, at a conceptual level, explains most of the quirks and capabilities you observe when working with these models.

Attention: The Core Mechanism

The key innovation in transformers is self-attention. When the model processes a sequence of tokens, each token doesn't just look at itself — it looks at every other token in the context and weighs how relevant each one is. The word "bank" in "river bank" and "bank account" should produce very different internal representations. Attention makes that happen.

Multi-head attention runs this process in parallel across multiple "heads," each learning to capture different kinds of relationships — syntax, coreference, semantic roles. The outputs get concatenated and projected back down. Stack 96 of these transformer blocks on top of each other (as in GPT-4 class models), add feed-forward layers, layer normalization, and residual connections, and you have the basic architecture.

What matters practically: the model has no explicit memory between turns except what's in its context window. Everything it "knows" about your conversation lives in the tokens you've passed it.

Tokenization: Not Characters, Not Words

Models don't read text the way you do. They read tokens — chunks of text that are typically 3-4 characters on average for English. The most common tokenization scheme is Byte Pair Encoding (BPE): start with individual bytes, then iteratively merge the most frequent pairs until you have a fixed vocabulary (GPT-4 uses ~100k tokens, Claude uses a similar range).

Why this matters for you as a developer:

Cost is billed per token, not per character or word. "unbelievable" might be 3 tokens; "API" is 1.
Context windows are measured in tokens. A 200k context window is roughly 150k words of English text, but shrinks for code (more tokens per "word") and expands for repetitive text.
Numbers and rare words tokenize inefficiently. 12345 might be 3 tokens; a word like "Schwarzenegger" might be 4.

You can explore tokenization directly:

```python import tiktoken # OpenAI's tokenizer library, also useful for Claude estimates

enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding tokens = enc.encode("Hello, how many tokens is this sentence?") print(f"Token count: {len(tokens)}") print(f"Tokens: {tokens}") # Token count: 9 # Tokens: [9906, 11, 1268, 1690, 11460, 374, 420, 11914, 30] ```

Anthropic's API also has a token counting endpoint — use it before sending large payloads to avoid surprises.

Pre-Training: Where the Knowledge Comes From

Pre-training is the phase where the model learns from a massive corpus — web crawls, books, code, academic papers. The training objective is deceptively simple: predict the next token. Do that billions of times across trillions of tokens, and gradient descent carves internal representations that encode grammar, facts, reasoning patterns, and code syntax into the model's weights.

The result is a base model — capable, but not aligned for conversation. Ask a raw base model a question and it might complete your prompt in strange ways, because it's trained to predict plausible continuations, not to answer helpfully.

Fine-Tuning and RLHF: Making Models Useful

Supervised fine-tuning (SFT) takes the base model and trains it further on curated examples of good conversations — question/answer pairs, instruction-following examples. This teaches the model the format and style of being helpful.

RLHF (Reinforcement Learning from Human Feedback) goes further. Human raters compare pairs of model outputs and rank which is better. A separate reward model is trained on those preferences. Then the main LLM is fine-tuned using RL (typically PPO) to maximize the reward model's score.

What RLHF actually does in practice: - Makes models follow instructions more reliably - Reduces harmful and offensive outputs - Shapes tone and personality (this is why Claude feels different from GPT-4o) - Can reduce raw capability if the reward model is miscalibrated (the "alignment tax")

Anthropic's Constitutional AI approach adds another layer: the model critiques its own outputs against a set of principles before rating them, reducing reliance on human annotation at scale.

Inference: Token by Token

When you send a prompt, the model doesn't produce the full response at once. It generates one token at a time, each time running the full forward pass through all transformer layers, then sampling from the output distribution.

Temperature controls how peaked that distribution is. At temperature 0, you always pick the highest-probability token. At temperature 1, you sample proportionally from the distribution. At temperature 2, you flatten it — more randomness, more surprising outputs.

The generated token is appended to the context, and the process repeats. This is called autoregressive generation — each token is conditioned on everything that came before it, including its own previous outputs.

This is also why models can "hallucinate" — there's no fact-checking step. The model is selecting statistically plausible continuations of text, and sometimes the most statistically plausible thing to say next is wrong.

What This Means for Your Work

Understanding the architecture clarifies several practical points:

Context is stateless per call. The model has no persistent memory unless you implement it.
Longer prompts cost more and can hurt performance. Attention is O(n²) in sequence length — not just a cost issue but a quality one past certain lengths.
Token position matters. Information early or late in a long context is attended to more reliably than information buried in the middle.
Models don't reason step-by-step by default. Chain-of-thought prompting works because it forces the model to produce intermediate tokens that guide subsequent generation.

The transformer architecture is not magic — it's a very large function that maps token sequences to probability distributions over the next token. Knowing that makes you a better API consumer.

Have a follow-up question about this topic?

Ask AI

Key Model Parameters