Evaluating AI Outputs: The Basics

Evaluation is the discipline that separates teams shipping reliable AI products from teams shipping AI products that work in demos. It is consistently under-invested in.

Evaluating AI Outputs: The Basics

Evaluation is the discipline that separates teams shipping reliable AI products from teams shipping AI products that work in demos. It is consistently under-invested in.

Why Evaluation Is Hard

For traditional software, correctness is binary. For language model outputs, correctness is rarely binary. A customer support response can be accurate, helpful, and still fail to match your brand tone. A code suggestion can be syntactically correct and logically wrong.

Compounding this: models sound confident regardless of whether they're right. Confidence in the output does not correlate with accuracy.

Three Evaluation Approaches

Human evaluation is the gold standard. Trained reviewers assess outputs against defined criteria. Surfaces nuanced failures that automated methods miss. Downside: slow and doesn't scale to thousands of test cases.

Automated metrics like BLEU and ROUGE measure text similarity between model output and a reference. Designed for machine translation — for open-ended generation, they correlate poorly with actual quality. A response can have a low ROUGE score and be excellent. Use only when a reference output exists and similarity is meaningful.

LLM-as-judge has become the dominant practical approach. Send a model's output plus context and a rubric to a stronger model (GPT-4o or Claude Sonnet) and ask it to score the response. Scales, runs automatically, captures semantic quality better than n-gram metrics. Limitation: judge models prefer longer responses and their own style. Calibrate by comparing judge scores to human ratings on a sample.

Define "Good" Before You Build

Before writing a single line of model-calling code, write down: - What does a correct response look like? What does failure look like? - Are there hard constraints (things the model must never do or must always include)? - How do you weight different failure modes?

This definition becomes your rubric. Without it, evaluation is subjective and results aren't comparable across iterations.

The Evaluation Dataset

Your eval set should be held-out — examples the model has not been trained or prompted on. For fine-tuned models, reserve 10–20% before training.

The eval set should cover: typical use cases, edge cases (unusual but valid inputs), and known failure modes.

Online vs. Offline Evaluation

Offline evaluation runs against a fixed test set before deployment. Fast, reproducible, catches known failures.

Online evaluation monitors production traffic — sampling live outputs and scoring them. Catches failures that emerge from real user behavior you didn't anticipate.

A mature practice uses both: offline to validate before shipping, online to catch what offline missed.

Have a follow-up question about this topic?

Ask AI

← Previous

Fine-tuning with OpenAI, Anthropic & Google

Building an Eval Suite