Learn/Developer & Coder/Model Benchmarks: What They Mean
Developer & Coder

Model Benchmarks: What They Mean

How to read benchmark scores honestly — what MMLU, HumanEval, and others actually measure.

Benchmarks Are Useful. They're Also Gamed.

When a new model drops and the blog post shows it beating GPT-4 on MMLU and HumanEval, that tells you something. It doesn't tell you whether it's better for your specific use case. Understanding what each benchmark actually measures — and what it doesn't — is essential for making real model selection decisions.

MMLU — Massive Multitask Language Understanding

What it tests: 57 academic subjects from elementary math to professional law and medicine. Multiple choice questions drawn from exam materials. Measures breadth of knowledge across domains.

What it actually tells you: Whether the model has absorbed factual information across a wide range of domains. A high MMLU score means the model has seen a lot of textbook content and can answer multiple-choice questions about it.

What it doesn't tell you: Whether the model can reason, write, code, follow instructions, or do anything practical. MMLU is a knowledge retrieval proxy, not a general capability benchmark. It also has a contamination problem — the test questions are public and have been on the internet for years. Models trained on large web crawls have very likely seen many of these questions.

Current frontier scores: Top models (GPT-4o, Claude, Gemini Ultra) are all above 85%. The benchmark is now near-saturated at the frontier — differences at the top are within noise.

HumanEval and SWE-bench — Coding Ability

HumanEval: 164 Python programming problems with function docstrings. The model writes the function body; automated tests check correctness. Introduced by OpenAI, now universally used.

What it tells you: Basic Python function-level coding ability. It's a real signal.

What it doesn't tell you: Whether the model can work in a real codebase with multiple files, understand existing code patterns, write idiomatic code in your stack, or debug complex issues.

SWE-bench: A harder, more realistic benchmark. Real GitHub issues from popular Python repos. The model must write code that makes the failing tests pass. SWE-bench Verified is the more carefully curated version.

SWE-bench is a much better signal for "can this model actually help me code" than HumanEval. Current frontier models score 40-50% on SWE-bench Verified — which means they fail more than half of real-world bug-fixing tasks even in controlled conditions.

MATH and AIME — Mathematical Reasoning

MATH: 12,500 competition math problems across difficulty levels. Requires multi-step reasoning, not just recall.

AIME: American Invitational Mathematics Examination problems. Harder, requires genuine mathematical insight.

These are better benchmarks for reasoning than MMLU because they require multi-step derivation. You can't memorize your way to a high score. High scores here correlate with general reasoning ability more than most other benchmarks.

GPQA — Graduate-Level Science

GPQA (Graduate-Level Google-Proof Q&A): Questions that even PhD experts in the field find challenging, designed specifically so that they can't be answered by web search. Tests deep scientific reasoning.

This benchmark was introduced partly as a response to MMLU saturation. It's harder to game through training data contamination because the questions require genuine reasoning about hard scientific problems, not just recall.

MT-Bench and Chatbot Arena — Conversational Quality

MT-Bench: 80 multi-turn questions across writing, reasoning, math, and coding. GPT-4 judges responses and scores them. Useful for conversational ability, but GPT-4 as judge introduces its own biases (tends to prefer GPT-style outputs).

Chatbot Arena (LMSYS): Users chat with two anonymous models and vote for which gave a better response. Results aggregate into an Elo-style ranking. This is the most ecologically valid benchmark — it reflects real human preference on real tasks.

Why Chatbot Arena matters: It's the hardest to game. You can't specifically train on the test questions because the questions are real, varied, live user queries. The ranking reflects what people actually prefer using.

The correlation between Arena scores and real-world quality is stronger than any other benchmark. When you're unsure which model to use, check the Arena leaderboard at lmarena.ai.

The Contamination Problem

Most benchmark datasets are public. They've been scraped and indexed by the web crawlers that feed training data. When a model is trained on a web crawl that includes benchmark questions and answers, it learns to answer those specific questions — a form of test set contamination.

This is why benchmark improvements sometimes don't translate to real-world improvements. A jump from 85% to 88% on MMLU might mean the new training run included more benchmark-adjacent data, not that the model got meaningfully smarter.

Benchmarks designed to resist contamination (GPQA, SWE-bench, Chatbot Arena) are more trustworthy signals.

How to Evaluate a Model for Your Actual Task

The honest take: no published benchmark will tell you which model is best for your specific use case. You need to evaluate on your actual task.

A practical evaluation process:

  1. 1.Collect 50-100 representative examples of your task — real inputs that reflect the distribution you'll see in production.
  2. 2.Write an evaluation rubric — what makes a response good or bad for your use case? This might be correctness (for extraction tasks), helpfulness (for chat), or format compliance.
  3. 3.Run all candidate models on the same examples with the same prompts.
  4. 4.Score the outputs — ideally with human review, or with a model-as-judge approach if scale requires it.
  5. 5.Look at failures, not just averages. A model that's 90% correct but catastrophically wrong in a specific pattern might be worse than one that's 85% correct uniformly.

This process takes a day or two. It's worth it for any non-trivial production deployment. The difference between "Claude scores higher on MMLU" and "Claude is better for our customer support use case" requires real evaluation to resolve.

Have a follow-up question about this topic?

Ask AI