Constitutional AI & RLHF

Anthropics Constitutional AI approach and how RLHF shapes model behavior across providers.

The Problem These Techniques Solve

A pre-trained language model is a powerful text-completion engine. It has learned from vast amounts of human-generated text and can produce fluent, knowledgeable content. But it will also complete a story about a bomb, continue racist text, or fill in the next steps in a phishing script — because those patterns exist in the training data and the base model has no judgment about what it should or should not do.

The techniques described here — RLHF and Constitutional AI — are the primary methods used to transform raw pre-trained models into the assistants that users interact with. They add the layer of judgment that base models lack.

RLHF — The Mechanics

Reinforcement Learning from Human Feedback was the core technique behind InstructGPT (OpenAI, 2022) and underlies most deployed chat models.

The process has three stages.

Stage 1: Demonstration data and supervised fine-tuning. Contractors write examples of good assistant behavior — quality responses to a variety of prompts. The base model is fine-tuned on these examples, producing an initial assistant model.

Stage 2: Reward model training. The fine-tuned model generates multiple responses to the same prompts. Human raters compare those responses and rank them from best to worst. These preference rankings are used to train a separate neural network — the reward model — that learns to predict human preferences. Given any model response, the reward model outputs a score approximating how much a human rater would prefer it.

Stage 3: Reinforcement learning. The assistant model is then fine-tuned using reinforcement learning (typically Proximal Policy Optimization, PPO) to produce responses that maximize the reward model's scores. The model learns: produce outputs that humans would rank highly.

The result is a model that is substantially more helpful, more appropriate in tone, and better at following instructions than the base model.

Where RLHF Breaks Down

Rater inconsistency is unavoidable. Different raters have different preferences, make different judgment calls, and bring different cultural backgrounds. The reward model trained on their rankings captures an average that may not represent any particular user's preferences well.

Reward hacking — the model finding ways to score high on the reward model without actually being better — is a real and documented problem. Models can learn to be verbose because length correlates with higher ratings, to be sycophantic because agreement feels good to raters, or to use confident tone because it signals expertise. These are not improvements in actual quality.

Goodhart's Law states this precisely: when a measure becomes a target, it ceases to be a good measure. Optimizing against the reward model eventually diverges from optimizing for actual quality.

Scalability limits: Human feedback is expensive to collect and limited in volume. Training on purely human feedback hits practical ceilings as models become more capable than the humans rating them.

Constitutional AI — Anthropic's Approach

Constitutional AI (CAI) is Anthropic's attempt to address some of RLHF's limitations, particularly scalability and the difficulty of specifying harmlessness through human feedback alone.

The key innovation is replacing human feedback for harmlessness evaluation with AI feedback guided by explicit principles.

The process works as follows. First, the model is given a set of principles — the "constitution." These are specific, articulable values like "prefer responses that are not harmful" or "prefer responses that respect human dignity" — derived from sources including the UN Declaration of Human Rights, Anthropic's own guidelines, and other ethical frameworks.

In a training loop called RLAIF (Reinforcement Learning from AI Feedback), the model generates a potentially harmful response, then critiques that response against the constitution, then revises it. The revised responses are used as training data. A separate AI model trained on these critiques acts as the reward model for harmlessness training.

For helpfulness, human feedback still plays a central role. The insight is that harmlessness is particularly well-suited to AI feedback because many harmful outputs are recognizable violations of articulable principles — you do not necessarily need a human to identify that a response provides instructions for creating a weapon.

Advantages

Scalability: AI feedback is cheap to generate compared to human labeling, allowing more training signal.

Transparency: The constitution is explicit and can be examined and debated. This is more legible than "whatever human raters preferred."

Consistency: An AI model applying a constitution is more consistent across similar cases than a pool of human raters.

Limitations

The constitution reflects choices made by Anthropic researchers. Those choices are human, fallible, and culturally situated. Making the principles explicit is progress, but it does not eliminate the value judgments embedded in them.

CAI also does not fully replace human feedback — helpfulness training still depends heavily on human preferences. And AI feedback can have its own systematic errors if the model used to provide feedback has its own biases.

OpenAI's Approach

OpenAI's training approach for GPT-3.5, GPT-4, and subsequent models is heavily RLHF-based, with human raters playing a central role. OpenAI works with Scale AI and its own internal team to generate comparison data at scale. The specifics of their exact fine-tuning procedure are not published in full, but the InstructGPT paper (2022) describes the methodology that underlies ChatGPT.

OpenAI has also invested in process reward models — teaching models to evaluate intermediate reasoning steps rather than just final outputs — as part of their work on o1 and reasoning-focused models.

The Overcorrection Problem

One of the most practically significant failure modes of both RLHF and CAI is overcorrection: producing models that refuse too much, add excessive caveats, or are condescending about users' ability to handle information.

A model trained heavily on "refuse harmful content" can generalize too broadly, refusing to discuss historical atrocities in educational contexts, declining to help with fiction involving conflict, or adding unnecessary warnings to questions about completely legal activities.

This is not a solved problem. Labs continuously work to calibrate the tradeoff between safety and usefulness, and users sometimes encounter both failure modes — models that refuse legitimate requests and models that assist with harmful ones — depending on how well-calibrated the specific version is.

The goal is a model that is genuinely helpful to the vast majority of legitimate uses while declining to assist with the small fraction of genuinely harmful ones. Threading that needle precisely is one of the central ongoing challenges in applied AI development.

Why This Matters

Understanding these techniques matters for several reasons. It explains why models from different labs have different personalities and refusal patterns — they have been trained with different constitutions, different rater pools, and different calibration choices. It explains why model behavior can change between versions even when capabilities are similar. And it explains why both over-refusal and under-refusal exist: they are both failure modes of the same optimization process, just in opposite directions.

These are not simple software bugs. They are consequences of optimizing complex human value judgments in settings with imperfect feedback. Getting this right is genuinely hard, and progress is real but incremental.

Have a follow-up question about this topic?

Ask AI

← Previous

AI Alignment & Safety

Regulations & Where Things Are Heading