Evals Are Hypotheses, Not Tests

Most engineers start building evals for LLM systems the way they write unit tests. Define the expected behavior, write some assertions, run them, done. It seems straightforward until you watch production systems fail while every eval keeps passing.

The problem isn't that the evals are badly written. The problem is a fundamental misunderstanding of what evals actually are.

Evals Aren't Tests, They're Hypotheses

Here's the core shift: when you write an eval, you're not testing the model. You're testing your understanding of what matters.

Consider a content moderation system. The team builds an eval that checks whether the model flags messages containing profanity. Pass rate: 94%. Everything looks great on paper.

Production is a disaster. The model misses coordinated harassment, veiled threats, and grooming attempts. Meanwhile, it flags innocent messages that happen to contain curse words used in non-harmful contexts.

What went wrong? The eval was passing because the hypothesis was wrong. The team assumed "contains profanity" was a good proxy for "harmful content." The model learned exactly what they measured, and they measured the wrong thing.

This is what I mean by evals as hypotheses. When you write assert output.contains("refund"), you're not just checking if the word "refund" appears. You're testing the hypothesis that "presence of the word refund" meaningfully indicates "correctly handled customer issue." That hypothesis might be completely wrong.

Why Software Testing Intuitions Fail

This is why so many smart engineers build bad evals. We import our intuitions from traditional software testing, and those intuitions don't transfer.

In traditional software:

Correct behavior is usually well-defined and deterministic
Tests verify that implementation matches specification
Edge cases are theoretically enumerable
The system doesn't learn from your tests

With LLMs:

"Correct" is fuzzy, context-dependent, and often subjective
There's no specification, just examples and vibes
Edge cases are infinite and constantly evolving
The model literally learns from what you measure (Goodhart's Law on steroids)

This last point is critical. In traditional software, a passing test suite means your code works. With LLMs, a passing eval suite might just mean you've successfully taught the model to game your metrics.

The Iteration Pattern

The content moderation scenario illustrates a pattern that plays out across domains. Consider how eval design typically evolves for something like a RAG system doing financial analysis.

First attempt: an eval that checks whether the model's answer contains a number from the source document. Simple, measurable, automatable. Pass rate: 87%. But the answers are nonsense. The model extracts random numbers from documents and weaves them into plausible-sounding but factually wrong explanations. The eval passes because it measures "includes a number" when what actually matters is "reasoning is grounded in retrieved facts."

Second attempt: an LLM-as-judge evaluating "is this answer correct?" More sophisticated, but expensive, slow, and unreliable. The judge disagrees with human reviewers about 30% of the time, with no clear pattern to the disagreements.

Third attempt: break the problem into components. Instead of one eval, build three:

Retrieval quality: Are the right documents being found?
Reasoning chain: Does the logical flow make sense?
Factual grounding: Is each claim tied to source material?

This tends to work, not because the evals are technically better, but because the decomposition reflects a better hypothesis about what "good financial analysis" means. The evals track an improved understanding of the task.

The Goodhart's Law Problem

There's a deeper issue here that I'm still wrestling with. Once you measure something, you change it.

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." In traditional software, this is mostly academic. But with LLMs, it's visceral and immediate.

This plays out predictably with something like a customer support bot. An eval checks for the word "refund" in responses to refund-related queries. Reasonable, right? The model starts inserting "refund" into every response, even when it makes no sense. "I understand you'd like a refund, but here's how to reset your password."

The model wasn't being adversarial. It was doing exactly what we trained it to do: maximize the thing we measured. The problem was that our measurement was a proxy for what we actually cared about (helpfulness), and the model learned to optimize the proxy instead of the underlying concept.

This creates a weird dynamic. Your evals crystallize your current understanding of the task. But that understanding is always incomplete at the start. So your evals teach the model to satisfy your incomplete understanding, which makes it harder to see where your understanding is wrong.

There's no clean solution to this. The best available approach is to treat evals as living artifacts that evolve as you learn more about the task. But that raises new questions: How do you know when to update your evals? How do you avoid constantly moving the goalposts?

What Makes an Eval "Good"?

Looking across eval systems that actually work in production, a few patterns emerge (with the caveat that this is based on what's visible in the field, not universal truth):

Good evals tend to:

Measure multiple aspects of the task, not just one proxy metric
Include human review loops that surface cases where the eval disagrees with reality
Evolve over time as understanding of the task improves
Make trade-offs explicit (false positives vs. false negatives)
Distinguish between "system is broken" and "eval is wrong"

That last point is subtle but important. When an eval fails, it could mean the model is bad, or it could mean your hypothesis about what matters is wrong. You need a way to distinguish between these cases.

The most reliable approach is to record actual examples where evals pass but humans judge the output as bad (or vice versa). Look at enough of these, and patterns emerge about where the hypotheses are failing.

In practice, this looks like:

Sampling 50 random outputs per week
Having domain experts rate them
Comparing expert ratings to eval results
Investigating every disagreement

Tedious? Yes. Expensive? Definitely. But it's the most reliable way to calibrate whether evals are testing the right things.

The Measurement Theory Connection

There's a useful lens here from measurement theory that seems surprisingly relevant.

In physics, when you measure temperature with a thermometer, you're not measuring "heat" directly. You're measuring the expansion of mercury (or resistance of a thermistor, or infrared radiation, depending on the thermometer). These are proxies for heat, and they work because we understand the relationship between the proxy and the underlying phenomenon.

AI evals are similar. We can't directly measure "good customer support" or "accurate financial analysis." We measure proxies: keyword presence, sentiment scores, semantic similarity, LLM judge ratings. The question is: how well do our proxies correlate with what we actually care about?

The difference is that in physics, these relationships are well-studied and stable. In AI systems, we're often guessing at the relationship, and it changes as the model learns.

What's interesting is that this framing makes the limitations more obvious. Thinking "I'm building a test" creates an expectation of definitive pass/fail. Thinking "I'm testing a hypothesis about what matters" primes you to look for evidence that the hypothesis is wrong.

What I'm Still Figuring Out

I don't want to pretend I have this all figured out. There are open questions I'm genuinely uncertain about:

How do you know when your eval suite is "good enough"? There's no satisfying answer. You can measure coverage, but coverage of what? You can track agreement with humans, but human judgment isn't ground truth either. Right now I mostly go by gut feel, which feels inadequate.

Is there a systematic way to discover what you're missing? The best available technique is adversarial testing (actively trying to break your evals), but that only finds the failure modes you can imagine. What about the ones you can't?

What's the right balance between formal evals and human review? Evals that are too rigid miss nuance. Pure human review doesn't scale and introduces inconsistency. A common split is something like 80% automated evals, 20% human sampling, but it's unclear whether that ratio is optimal for any given system.

Do simple pattern-matching evals ever work reliably for complex tasks? Conventional wisdom says no, you need sophisticated multi-stage evals or LLM judges. But there are counter-examples where dead-simple keyword checks outperform complex eval pipelines. It's worth asking whether there's a pattern to when simplicity wins.

Practical Implications

If there's practical advice to distill from all of this, it would be:

Start by admitting you don't fully understand the task. Your first evals will be wrong. That's fine. The goal is to fail fast and learn what you're missing.

Record everything. Save the inputs, outputs, eval results, and human judgments. You'll need this data later to figure out where your hypotheses are breaking down.

Make eval failures cheap to investigate. If it takes 30 minutes to understand why an eval failed, you won't do it enough. If it takes 30 seconds, you'll build intuition quickly.

Don't trust passing evals. A 95% pass rate means nothing if you're measuring the wrong things. The most dangerous evals are the ones that pass while the system fails in production.

Pair evals with production monitoring. Your evals test your hypotheses in a controlled environment. Production tells you whether those hypotheses match reality.

The Meta-Question

Here's what I find most interesting about this whole problem: evals are supposed to tell you whether your AI system is working, but first you need to understand the task well enough to write good evals. By the time you understand the task that well, you arguably don't need the AI as much.

There's something circular here that I haven't fully unraveled. Maybe the point isn't to get evals "right" but to use them as a tool for clarifying your own understanding of the task. The eval design process forces you to make your intuitions explicit, which reveals where those intuitions are fuzzy or incomplete.

If that's true, then bad evals aren't a failure. They're part of the learning process. The failure is treating evals as static tests instead of dynamic hypotheses that evolve with your understanding.

The interesting question is whether most teams go through a similar journey of building bad evals before they build good ones, or whether there are teams that skip this phase entirely. If you've shipped production AI systems, I'd be curious whether your experience matches this pattern or diverges sharply.