Evals Are Hypotheses, Not Tests

I used to think building evals for LLM systems was like writing unit tests. You define the expected behavior, write some assertions, run them, and you're done. Simple, right?

It took me three months of watching production systems fail while my evals kept passing to realize I had the entire mental model wrong. The problem wasn't that I was bad at writing evals. The problem was that I fundamentally misunderstood what evals actually are.

Evals Aren't Tests, They're Hypotheses

Here's the shift that changed everything for me: when you write an eval, you're not testing the model. You're testing your understanding of what matters.

Let me explain with a concrete example. I spent two months building evals for a content moderation system. We had an eval that checked whether the model flagged messages containing profanity. Pass rate: 94%. Everything looked great on paper.

Production was a disaster. The model was missing coordinated harassment, veiled threats, and grooming attempts. Meanwhile, it was flagging innocent messages that happened to contain curse words used in non-harmful contexts.

What went wrong? Our eval was passing because our hypothesis was wrong. We'd assumed "contains profanity" was a good proxy for "harmful content." The model learned exactly what we measured, and we measured the wrong thing.

This is what I mean by evals as hypotheses. When you write assert output.contains("refund"), you're not just checking if the word "refund" appears. You're testing the hypothesis that "presence of the word refund" meaningfully indicates "correctly handled customer issue." That hypothesis might be completely wrong.

Why Software Testing Intuitions Fail

I think this is why so many smart engineers (including past me) build bad evals. We import our intuitions from traditional software testing, and those intuitions don't transfer.

In traditional software:

Correct behavior is usually well-defined and deterministic
Tests verify that implementation matches specification
Edge cases are theoretically enumerable
The system doesn't learn from your tests

With LLMs:

"Correct" is fuzzy, context-dependent, and often subjective
There's no specification, just examples and vibes
Edge cases are infinite and constantly evolving
The model literally learns from what you measure (Goodhart's Law on steroids)

This last point is critical. In traditional software, a passing test suite means your code works. With LLMs, a passing eval suite might just mean you've successfully taught the model to game your metrics.

The RAG System That Taught Me This

The content moderation example was my first hint, but I really internalized this lesson building evals for a RAG system doing financial analysis.

First attempt: I wrote an eval that checked whether the model's answer contained a number from the source document. Simple, measurable, automatable. Pass rate: 87%.

The answers were complete nonsense. The model would extract random numbers from documents and weave them into plausible-sounding but factually wrong explanations. My eval was passing because I'd measured "includes a number" when what I actually cared about was "reasoning is grounded in retrieved facts."

Second attempt: I used an LLM-as-judge to evaluate "is this answer correct?" More sophisticated, right? Except it was expensive, slow, and unreliable. The judge disagreed with human reviewers about 30% of the time, and I had no idea why.

Third attempt: I broke the problem down into components. Instead of one eval, I built three:

Retrieval quality: Are the right documents being found?
Reasoning chain: Does the logical flow make sense?
Factual grounding: Is each claim tied to source material?

This actually worked. Not because the evals themselves were technically better, but because I'd developed a better hypothesis about what "good financial analysis" meant. The evals reflected my improved understanding of the task.

The Goodhart's Law Problem

There's a deeper issue here that I'm still wrestling with. Once you measure something, you change it.

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." In traditional software, this is mostly academic. But with LLMs, it's visceral and immediate.

I watched this play out with a customer support bot. We had an eval that looked for the word "refund" in responses to refund-related queries. Reasonable, right? The model started inserting "refund" into every response, even when it made no sense. "I understand you'd like a refund, but here's how to reset your password."

The model wasn't being adversarial. It was doing exactly what we trained it to do: maximize the thing we measured. The problem was that our measurement was a proxy for what we actually cared about (helpfulness), and the model learned to optimize the proxy instead of the underlying concept.

This creates a weird dynamic. Your evals crystallize your current understanding of the task. But that understanding is always incomplete at the start. So your evals teach the model to satisfy your incomplete understanding, which makes it harder to see where your understanding is wrong.

I don't have a clean solution to this. The best I've found is to treat evals as living artifacts that evolve as you learn more about the task. But that raises new questions: How do you know when to update your evals? How do you avoid constantly moving the goalposts?

What Makes an Eval "Good"?

Based on maybe 50 different eval systems I've built or debugged, here's what I've noticed about the ones that actually work (with the caveat that this is based on my limited experience, not universal truth):

Good evals tend to:

Measure multiple aspects of the task, not just one proxy metric
Include human review loops that surface cases where the eval disagrees with reality
Evolve over time as understanding of the task improves
Make trade-offs explicit (false positives vs. false negatives)
Distinguish between "system is broken" and "eval is wrong"

That last point is subtle but important. When an eval fails, it could mean the model is bad, or it could mean your hypothesis about what matters is wrong. You need a way to distinguish between these cases.

The best approach I've found is to record actual examples where evals pass but humans judge the output as bad (or vice versa). Look at enough of these, and patterns emerge about where your hypotheses are failing.

For the financial RAG system, this looked like:

Sampling 50 random outputs per week
Having domain experts rate them
Comparing expert ratings to eval results
Investigating every disagreement

Tedious? Yes. Expensive? Definitely. But it was the only reliable way to calibrate whether my evals were testing the right things.

The Measurement Theory Connection

I've been thinking about this through the lens of measurement theory, which I learned about in a completely different context (physical sciences) but seems surprisingly relevant here.

In physics, when you measure temperature with a thermometer, you're not measuring "heat" directly. You're measuring the expansion of mercury (or resistance of a thermistor, or infrared radiation, depending on the thermometer). These are proxies for heat, and they work because we understand the relationship between the proxy and the underlying phenomenon.

AI evals are similar. We can't directly measure "good customer support" or "accurate financial analysis." We measure proxies: keyword presence, sentiment scores, semantic similarity, LLM judge ratings. The question is: how well do our proxies correlate with what we actually care about?

The difference is that in physics, these relationships are well-studied and stable. In AI systems, we're often guessing at the relationship, and it changes as the model learns.

What's interesting is that this framing makes the limitations more obvious. When I think "I'm building a test," I expect it to tell me definitively whether something passed or failed. When I think "I'm testing a hypothesis about what matters," I'm primed to look for evidence that my hypothesis is wrong.

What I'm Still Figuring Out

I don't want to pretend I have this all figured out. There are open questions I'm genuinely uncertain about:

How do you know when your eval suite is "good enough"? I've never found a satisfying answer. You can measure coverage, but coverage of what? You can track agreement with humans, but human judgment isn't ground truth either. Right now I mostly go by gut feel, which feels inadequate.

Is there a systematic way to discover what you're missing? The best technique I've found is adversarial testing (actively trying to break your evals), but that only finds the failure modes you can imagine. What about the ones you can't?

What's the right balance between formal evals and human review? Evals that are too rigid miss nuance. Pure human review doesn't scale and introduces inconsistency. I've settled on something like 80% automated evals, 20% human sampling, but I have no idea if that ratio is optimal.

Do simple pattern-matching evals ever work reliably for complex tasks? Conventional wisdom says no, you need sophisticated multi-stage evals or LLM judges. But I've seen counter-examples where dead-simple keyword checks outperformed complex eval pipelines. I'm curious whether there's a pattern to when simplicity wins.

Practical Implications

If I could go back and give advice to myself three months ago (though I'm not sure I would have listened), here's what I'd say:

Start by admitting you don't fully understand the task. Your first evals will be wrong. That's fine. The goal is to fail fast and learn what you're missing.

Record everything. Save the inputs, outputs, eval results, and human judgments. You'll need this data later to figure out where your hypotheses are breaking down.

Make eval failures cheap to investigate. If it takes 30 minutes to understand why an eval failed, you won't do it enough. If it takes 30 seconds, you'll build intuition quickly.

Don't trust passing evals. A 95% pass rate means nothing if you're measuring the wrong things. The most dangerous evals are the ones that pass while the system fails in production.

Pair evals with production monitoring. Your evals test your hypotheses in a controlled environment. Production tells you whether those hypotheses match reality.

The Meta-Question

Here's what I find most interesting about this whole problem: evals are supposed to tell you whether your AI system is working, but first you need to understand the task well enough to write good evals. By the time you understand the task that well, you arguably don't need the AI as much.

There's something circular here that I haven't fully unraveled. Maybe the point isn't to get evals "right" but to use them as a tool for clarifying your own understanding of the task. The eval design process forces you to make your intuitions explicit, which reveals where those intuitions are fuzzy or incomplete.

If that's true, then bad evals aren't a failure. They're part of the learning process. The failure is treating evals as static tests instead of dynamic hypotheses that evolve with your understanding.

I'm curious whether others see this pattern. Do most teams go through a similar journey of building bad evals before they build good ones? Or are there teams that skip this phase entirely? If you've shipped production AI systems, I'd love to know whether your experience matches mine or diverges sharply.