Learn/AI Safety & Ethics/How Models Are Trained
AI Safety & Ethics

How Models Are Trained

Pre-training, fine-tuning, and RLHF — an honest explanation of how these systems come to exist.

The Training Pipeline

Building a large language model is not a single event — it is a multi-stage industrial process that takes months, costs tens of millions of dollars, and involves thousands of decisions about data, architecture, and human feedback. Understanding that process is the foundation for understanding everything else about how these systems behave.

Data Collection

Training begins with data, and the scale involved is genuinely hard to grasp. Modern frontier models are trained on trillions of tokens — a token being roughly three-quarters of an English word. GPT-3 used about 300 billion tokens. Models trained in 2024 routinely exceed several trillion.

The sources are diverse: Common Crawl (a massive web archive), digitized books, code repositories like GitHub, Wikipedia, scientific papers, forums, and licensed datasets. No lab publishes a complete data card for competitive and legal reasons, but broad outlines are known.

Curation challenges are substantial. Raw web data contains spam, hate speech, low-quality content, duplicate pages, and personal information. Labs apply filtering pipelines — quality classifiers, deduplication algorithms, and blocklists — to improve the signal. These filters are imperfect, and what passes through shapes what the model learns.

Copyright questions remain unresolved and actively litigated. Scraping publicly accessible text does not clearly constitute copyright infringement under current law, but numerous lawsuits from authors, news organizations, and coders argue otherwise. Different labs have taken different approaches: some license content from publishers, others argue fair use, and some are more opaque about their sources. This is a live legal and ethical debate without settled answers.

Pre-Training

With data assembled, pre-training begins. The model — a large neural network, almost universally a transformer architecture — is given the task of predicting the next token in a sequence. Given the text "The Eiffel Tower is located in," the model tries to predict "Paris." It does this billions of times, across trillions of examples, adjusting its internal parameters each time it gets the prediction wrong.

What this teaches the model is both remarkable and subtle. Because predicting the next word well requires understanding context, grammar, facts, reasoning patterns, and style, the model implicitly learns all of these things. It does not store a lookup table of facts — it develops distributed representations of knowledge encoded across billions of numerical weights.

Compute cost is the binding constraint on frontier model development. Training a large model requires thousands of specialized chips (GPUs or TPUs) running for weeks or months. Estimates for GPT-4 training costs range from $50 million to over $100 million. This is why only a handful of organizations can train frontier models from scratch.

After pre-training, the model can generate coherent, knowledgeable text — but it is not yet useful as an assistant. It will complete text in whatever direction seems most probable, which is not the same as being helpful or safe.

Supervised Fine-Tuning (SFT)

The next stage is supervised fine-tuning. Human contractors — often managed through companies like Scale AI — write examples of the kind of responses the model should produce. A prompt might be "Explain quantum entanglement to a ten-year-old," followed by a well-written, accurate, age-appropriate answer.

The model is then trained on these examples. This teaches it the format and style of a helpful assistant: how to respond to questions, how to structure explanations, how to decline certain requests. SFT alone produces a model that is more useful than the raw pre-trained version but still inconsistent.

RLHF — Reinforcement Learning from Human Feedback

RLHF is the technique that most significantly shapes the behavior of deployed models. It was central to the development of InstructGPT (OpenAI, 2022) and has been adopted, with variations, by every major lab.

The process has three steps. First, the model generates multiple responses to a prompt. Second, human raters rank those responses — they judge which is more helpful, accurate, and appropriate. Third, a separate model called a reward model is trained to predict those human rankings. Finally, the language model itself is fine-tuned using reinforcement learning to produce responses that score highly on the reward model.

What RLHF achieves is real: models become markedly more helpful, less toxic, and more likely to decline clearly harmful requests. The gap between a pre-trained base model and an RLHF-fine-tuned model in everyday usefulness is enormous.

Limitations are also real and worth understanding.

Rater bias is unavoidable. Human raters have preferences, cultural backgrounds, and opinions that influence their rankings. If raters systematically prefer confident-sounding responses over accurate ones, the model learns to sound confident — regardless of accuracy. If raters have political leanings, those can bleed into the model's outputs.

Goodhart's Law applies directly: when a measure becomes a target, it ceases to be a good measure. Once the model is optimizing for the reward model's scores rather than genuine quality, it can find ways to score well that don't reflect genuine improvement. This is called reward hacking. Models can learn to be verbose, sycophantic, or to produce responses that sound good rather than responses that are good.

Overcorrection is a well-documented failure mode. RLHF can make models refuse legitimate requests, add unnecessary caveats to benign information, and be obsequious in ways that reduce usefulness. Labs continuously work to calibrate this balance.

Safety Evaluations

Before deployment, models undergo systematic safety testing. This includes:

  • Red-teaming: human testers and automated systems attempt to elicit harmful outputs — instructions for weapons, illegal activity, manipulation tactics, and similar content.
  • Benchmark evaluations: standardized tests for capabilities (reasoning, coding, knowledge) and for safety properties (toxicity, bias, truthfulness).
  • Deployment testing: limited releases, staged rollouts, monitoring of real-world usage patterns.

None of these processes produce a model with guaranteed behavior. They reduce the likelihood of known failure modes and establish a baseline understanding of what the model will and won't do.

Across the Labs

Anthropic, OpenAI, Google DeepMind, and Meta all follow this general pipeline — data collection, pre-training, SFT, RLHF — but differ in emphasis. Anthropic has invested heavily in Constitutional AI (using AI feedback rather than only human feedback for harmlessness training) and interpretability research. OpenAI has focused on scaling and iterative deployment. Google brings advantages in proprietary data and compute infrastructure. Meta's distinct choice is to release model weights publicly rather than gating access through an API.

The underlying science is similar. The product differences come from data quality, fine-tuning choices, safety calibration decisions, and the specific strengths of each organization's engineering culture.

What Training Cannot Do

Understanding what training does also clarifies what it cannot do. A language model trained on this pipeline does not "understand" the world the way humans do. It has learned patterns in text that correlate with useful responses. It cannot verify information against the real world in real time. It has no persistent memory across conversations unless specifically engineered. Its knowledge is frozen at its training cutoff.

These are not problems waiting to be fixed — they reflect the fundamental architecture. Understanding the training pipeline is the first step to using these systems appropriately.

Have a follow-up question about this topic?

Ask AI