Knowing that evaluation matters is different from having a working eval suite. Here's how to build one practically.
Knowing that evaluation matters is different from having a working eval suite. Here's how to build one practically.
Start with 50 to 200 examples. Fewer than 50 gives noisy signal; more than 200 is unnecessary to start. Three categories:
Typical inputs — The common 80% of what users will actually send. Validate the model handles its core job correctly.
Edge cases — Unusual but valid inputs: very long messages, ambiguous requests, inputs in different languages, inputs touching boundary conditions.
Known failure modes — If you've already shipped, use examples from actual failures or user complaints. Regressions against known failures are the most important tests to run.
For each test case, define what correct looks like: - Exact match — Output must contain a specific string. Good for structured outputs like JSON fields. - Rubric-based scoring — A judge model evaluates on a 1–5 scale across dimensions (accuracy, tone, completeness). - Binary classifier — A model or function answers yes/no to a specific question about the output.
Write criteria before you run your first eval — criteria written after seeing outputs are unconsciously biased.
Basic structure: 1. Load test cases from a file 2. Send each input to the model 3. Score each output against criteria 4. Aggregate scores, output a summary 5. Store results with a timestamp and model/prompt version
Keep the script in your repository and treat it like a test suite — run it on every significant prompt change.
PromptFoo (open source) — Fastest way to start. Define test cases in YAML, specify multiple models to compare, run with one command. Supports LLM-as-judge and generates comparison reports.
Braintrust — Hosted platform with a UI for reviewing results, tracking scores over time, and collaborating with non-technical stakeholders.
LangSmith — Integrates tightly with LangChain. If you're already using LangChain, the natural fit.
Improve prompt or model → run evals → compare scores to baseline → decide whether to ship.
Version your prompts and store eval results alongside each version. A score of 87% doesn't mean anything unless you know last week's version scored 84%.
With a fast, cheap model — Claude Haiku 4.5 or GPT-4o mini — a 200-case eval using LLM-as-judge typically costs under $1. There is no excuse for skipping evaluation on cost grounds.
Have a follow-up question about this topic?
Ask AI