Building an Eval Suite

Knowing that evaluation matters is different from having a working eval suite. Here's how to build one practically.

Building an Eval Suite

Knowing that evaluation matters is different from having a working eval suite. Here's how to build one practically.

Collecting Test Cases

Start with 50 to 200 examples. Fewer than 50 gives noisy signal; more than 200 is unnecessary to start. Three categories:

Typical inputs — The common 80% of what users will actually send. Validate the model handles its core job correctly.

Edge cases — Unusual but valid inputs: very long messages, ambiguous requests, inputs in different languages, inputs touching boundary conditions.

Known failure modes — If you've already shipped, use examples from actual failures or user complaints. Regressions against known failures are the most important tests to run.

Defining Pass/Fail Criteria

For each test case, define what correct looks like: - Exact match — Output must contain a specific string. Good for structured outputs like JSON fields. - Rubric-based scoring — A judge model evaluates on a 1–5 scale across dimensions (accuracy, tone, completeness). - Binary classifier — A model or function answers yes/no to a specific question about the output.

Write criteria before you run your first eval — criteria written after seeing outputs are unconsciously biased.

Automating the Eval Script

Basic structure: 1. Load test cases from a file 2. Send each input to the model 3. Score each output against criteria 4. Aggregate scores, output a summary 5. Store results with a timestamp and model/prompt version

Keep the script in your repository and treat it like a test suite — run it on every significant prompt change.

Tooling

PromptFoo (open source) — Fastest way to start. Define test cases in YAML, specify multiple models to compare, run with one command. Supports LLM-as-judge and generates comparison reports.

Braintrust — Hosted platform with a UI for reviewing results, tracking scores over time, and collaborating with non-technical stakeholders.

LangSmith — Integrates tightly with LangChain. If you're already using LangChain, the natural fit.

The Eval Feedback Loop

Improve prompt or model → run evals → compare scores to baseline → decide whether to ship.

Version your prompts and store eval results alongside each version. A score of 87% doesn't mean anything unless you know last week's version scored 84%.

Cost of Running Evals

With a fast, cheap model — Claude Haiku 4.5 or GPT-4o mini — a 200-case eval using LLM-as-judge typically costs under $1. There is no excuse for skipping evaluation on cost grounds.

Have a follow-up question about this topic?

Ask AI

← Previous

Evaluating AI Outputs: The Basics

Red-teaming & Adversarial Testing