Tokens, transformers, training, and inference — technically honest without being overwhelming.
GPT-4, Claude, Gemini, Llama — they're all built on the same foundational architecture: the transformer, introduced by Google in the 2017 paper "Attention Is All You Need." Understanding what a transformer does, at a conceptual level, explains most of the quirks and capabilities you observe when working with these models.
The key innovation in transformers is self-attention. When the model processes a sequence of tokens, each token doesn't just look at itself — it looks at every other token in the context and weighs how relevant each one is. The word "bank" in "river bank" and "bank account" should produce very different internal representations. Attention makes that happen.
Multi-head attention runs this process in parallel across multiple "heads," each learning to capture different kinds of relationships — syntax, coreference, semantic roles. The outputs get concatenated and projected back down. Stack 96 of these transformer blocks on top of each other (as in GPT-4 class models), add feed-forward layers, layer normalization, and residual connections, and you have the basic architecture.
What matters practically: the model has no explicit memory between turns except what's in its context window. Everything it "knows" about your conversation lives in the tokens you've passed it.
Models don't read text the way you do. They read tokens — chunks of text that are typically 3-4 characters on average for English. The most common tokenization scheme is Byte Pair Encoding (BPE): start with individual bytes, then iteratively merge the most frequent pairs until you have a fixed vocabulary (GPT-4 uses ~100k tokens, Claude uses a similar range).
Why this matters for you as a developer:
12345 might be 3 tokens; a word like "Schwarzenegger" might be 4.You can explore tokenization directly:
```python import tiktoken # OpenAI's tokenizer library, also useful for Claude estimates
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding tokens = enc.encode("Hello, how many tokens is this sentence?") print(f"Token count: {len(tokens)}") print(f"Tokens: {tokens}") # Token count: 9 # Tokens: [9906, 11, 1268, 1690, 11460, 374, 420, 11914, 30] ```
Anthropic's API also has a token counting endpoint — use it before sending large payloads to avoid surprises.
Pre-training is the phase where the model learns from a massive corpus — web crawls, books, code, academic papers. The training objective is deceptively simple: predict the next token. Do that billions of times across trillions of tokens, and gradient descent carves internal representations that encode grammar, facts, reasoning patterns, and code syntax into the model's weights.
The result is a base model — capable, but not aligned for conversation. Ask a raw base model a question and it might complete your prompt in strange ways, because it's trained to predict plausible continuations, not to answer helpfully.
Supervised fine-tuning (SFT) takes the base model and trains it further on curated examples of good conversations — question/answer pairs, instruction-following examples. This teaches the model the format and style of being helpful.
RLHF (Reinforcement Learning from Human Feedback) goes further. Human raters compare pairs of model outputs and rank which is better. A separate reward model is trained on those preferences. Then the main LLM is fine-tuned using RL (typically PPO) to maximize the reward model's score.
What RLHF actually does in practice: - Makes models follow instructions more reliably - Reduces harmful and offensive outputs - Shapes tone and personality (this is why Claude feels different from GPT-4o) - Can reduce raw capability if the reward model is miscalibrated (the "alignment tax")
Anthropic's Constitutional AI approach adds another layer: the model critiques its own outputs against a set of principles before rating them, reducing reliance on human annotation at scale.
When you send a prompt, the model doesn't produce the full response at once. It generates one token at a time, each time running the full forward pass through all transformer layers, then sampling from the output distribution.
Temperature controls how peaked that distribution is. At temperature 0, you always pick the highest-probability token. At temperature 1, you sample proportionally from the distribution. At temperature 2, you flatten it — more randomness, more surprising outputs.
The generated token is appended to the context, and the process repeats. This is called autoregressive generation — each token is conditioned on everything that came before it, including its own previous outputs.
This is also why models can "hallucinate" — there's no fact-checking step. The model is selecting statistically plausible continuations of text, and sometimes the most statistically plausible thing to say next is wrong.
Understanding the architecture clarifies several practical points:
The transformer architecture is not magic — it's a very large function that maps token sequences to probability distributions over the next token. Knowing that makes you a better API consumer.
Have a follow-up question about this topic?
Ask AI