Evaluating Agents Is a Different Problem

Every component eval passed. Every single one.

The classifier identified claim types correctly 91% of the time. The extraction model pulled the right fields 88% of the time. The fraud detection module flagged suspicious patterns with decent precision. I stared at those numbers for a while, felt the relief of green dashboards, and deployed the full insurance claims agent to production.

Within the first week it confidently approved a claim that should have been flagged. A property damage claim for $4,800, normally auto-approved at that amount, but this was the third similar claim from the same address in six months. The classifier got the type right. The extractor got the dollar amount right. The routing logic combined those two correct outputs and hit an edge case that sent it to auto-approve instead of fraud review. Each step was individually correct. The trajectory was wrong.

What bothered me wasn't the failure itself. Systems fail. What bothered me was the feeling that preceded it: the false confidence. I had looked at component-level metrics, seen passing numbers, and concluded the system worked. I had mistaken "the pieces work" for "the whole works," and the gap between those claims is exactly where agents break.

I wrote "Evals Are Hypotheses" about discovering that evals test your understanding, not the model. That insight was enough for single LLM calls. Agents taught me it was only half the lesson.

Where Agents Actually Fail

When I audit agent failures now, roughly 70% of the real problems happen in the spaces between tool calls. The model interprets a tool's output incorrectly. It loses context from three steps ago. It makes a reasonable decision at step 5 that contradicts a constraint established at step 2. The errors are relational, not absolute.

I thought this might be specific to my systems until I started reading the research. Cemri, Pan, and Yang analyzed over 1,600 failure traces across seven multi-agent frameworks and built a taxonomy of 14 distinct failure modes. The failures clustered at the seams between components: system design issues, inter-agent misalignment, task verification gaps. ToolBench found something that stopped me cold: 75% of agent trajectories suffered from incompleteness or hallucinations even when the final answers were sometimes correct. The paths were broken. The destinations occasionally weren't.

That last finding is the one I keep coming back to. If you only check the final output, you might conclude your agent is working 60% of the time. If you check the trajectories, you discover 75% of them are broken, and the 60% "success" rate is partly luck: broken paths that happened to arrive somewhere acceptable. It's like grading a student's math test by checking only the final answer. They might get it right while doing the arithmetic wrong, and the next problem, where the wrong arithmetic doesn't cancel out, they'll fail.

The Arithmetic That Scares Me

If a single LLM call is 90% accurate, you have a 10% error rate. Chain ten 90%-accurate steps together and your trajectory accuracy drops to roughly 0.9^10: about 35%. You go from "pretty good" to "wrong most of the time" just by composing steps.

But this simple model is misleading, because errors don't compound uniformly. I've started noticing three distinct propagation behaviors, and the differences between them matter more than I initially realized:

Self-correcting errors. The agent makes a wrong call at step 3, but step 4's tool output contradicts the assumption, and the agent adjusts. These are benign. They're like a wrong turn where you notice the street name doesn't match and reroute.

Invisible errors. The agent proceeds confidently on a wrong assumption, and nothing in subsequent steps reveals the mistake. The trajectory looks clean. The output looks reasonable. Only a human who knows the domain would spot the gap. These are what made the insurance claim failure so unsettling. Everything looked right.

Amplifying errors. An early error pushes the agent into a region of the decision space where subsequent decisions are all subtly wrong. I had an agent misidentify a document type in step 1. Because it thought it was processing a different kind of document, every subsequent extraction, validation, and routing decision was calibrated for the wrong task. Eight steps of confident, internally-consistent, completely wrong work. The error at step 1 was minor. The trajectory was catastrophic.

The severity of an error at the point it occurs tells you almost nothing about its impact on the final outcome. A low-severity amplifying error can be worse than a high-severity self-correcting one. I'm not confident this taxonomy is complete; there might be propagation patterns I'm not seeing. But it's been more useful for prioritizing which failures to investigate than any severity-based approach I've tried.

What genuinely worries me: I have no reliable way to distinguish invisible errors from correct trajectories without reading the full trace myself. They look identical from the outside. I can catch self-correcting errors because the backtracking shows up in the trace. I can catch amplifying errors because the cascading wrongness eventually produces a visibly bad output. But invisible errors, the ones where the agent is wrong and nothing downstream reveals it, those I only find when I happen to be reading traces for other reasons. I don't know how many are hiding in trajectories I've never reviewed.

Three Layers (How I Think About It Now)

After hitting these problems repeatedly, I've organized my thinking into three evaluation layers. This is my current best model, not a prescriptive framework, and I'd revise it if I found failures it can't account for.

Layer 1: Decision-point evals. Does the agent make reasonable decisions at individual choice points, evaluated in context? Not "given this input, is this output reasonable?" but "given everything the agent has done so far, is this next step reasonable?" Context changes what counts as a good decision.

Layer 2: Trajectory evals. Does the overall path represent a coherent strategy? OpenAI's "Let's Verify Step by Step" demonstrated this principle in math: process supervision (feedback on each reasoning step) significantly outperformed outcome supervision (only checking the final answer). A process-supervised model solved 78% of MATH problems compared to lower rates for outcome-only. The same principle applies to agents. Checking each step in sequence catches failures that checking only the final output misses.

Layer 3: Outcome evals with trajectory constraints. Did the agent achieve the goal and get there acceptably? An agent that approves the right claims but accesses a database it shouldn't have has a good outcome and an unacceptable trajectory. This is where safety and compliance live, and it's the hardest layer to build evals for.

Most teams I've talked to only do Layer 1. Almost nobody evaluates trajectories systematically. I think this is the biggest gap in how the industry evaluates agents.

The Case Against (Which Is Partially Right)

The strongest objection: trajectory evaluation is overkill for most production agents. The composition problem is real but overstated. Self-correcting errors dominate. The engineering cost of recording and analyzing full traces doesn't justify the marginal improvement over outcome-only evaluation.

This is partially right. For simple, short-chain agents (two or three steps, well-defined tools, narrow domains), outcome evaluation probably is sufficient. Not every agent needs trajectory evals. And the cost argument is real: I've seen teams invest weeks in trajectory evaluation infrastructure for agents that would have been adequately served by careful outcome testing.

Where the objection fails is in its assumption about error distributions. In my experience, self-correcting errors dominate only in agents with good error-recovery design, which is itself a form of trajectory-level thinking. Agents without explicit recovery logic tend toward invisible and amplifying errors, which outcome-only evaluation systematically misses. The 75% trajectory failure rate from ToolBench and the 60%-to-25% consistency drop that Simmering documented suggest the composition problem isn't niche. It's the default for agents beyond a certain complexity.

The question is where the threshold falls. I think it's lower than most teams assume, probably around 4-5 steps with branching logic. But if someone showed me data that self-correcting errors dominate even in complex agents, I'd revise my three-layer model significantly.

What Changed My Practice

The single highest-value change: recording full traces for every agent run. Every tool call, every intermediate result, every decision point. Last month, an agent started approving refunds it shouldn't have. Without the trace, I would have spent days reproducing the issue. With it, I found the problem in twenty minutes: a tool returning dates in an unexpected timezone format, the agent comparing timestamps silently off by five hours. The tool returned valid data. The agent made a valid comparison. The bug lived in the relationship between the two steps, invisible to any eval that checked them independently.

From there I started building what I call trajectory assertions: constraints on how steps relate to each other. Not "the output should contain X" but "the agent should never call the approval endpoint after receiving a fraud flag" and "if the agent requests clarification, it must use the clarified information within two subsequent tool calls." I have 23 of these for my most complex agent. Twelve were written in response to production failures. The other eleven came from staring at traces and asking "what invariant would have caught this earlier?"

The practice that felt most uncomfortable to adopt: deliberate fault injection. Forcing tools to fail, return ambiguous data, or timeout at each step and watching what happens. It felt uncomfortable because the results were consistently humbling. Agents that looked solid on the happy path fell apart the moment anything went wrong. I kept finding that my agents had no real error-recovery strategy; they just happened to work when everything around them worked. Discovering that about systems I'd built and deployed was not a great feeling.

The Recursive Problem

In "Evals Are Hypotheses," a failing eval could mean the model is bad or the eval is wrong. With agents, add two more options: the eval is correct but a tool gave bad information, or the eval is correct, the tool was fine, but the agent lost context from earlier in the trajectory and made a reasonable decision given its incomplete state. Four failure modes. Distinguishing between them requires the full trace.

This is what I mean by the problem being recursive. Your eval tests your understanding of what the agent should do. But the agent is making decisions based on its understanding of its tools and context. You're evaluating an evaluator with the same imperfect instruments. At least half the time, what looks like an agent failure turns out to be a tool returning unexpected data or my eval encoding an assumption that doesn't hold.

What I Don't Know

I don't know whether trajectory evaluation can ever be principled, or whether it will always come down to "record everything and have experienced humans review the weird ones." Process reward models and uncertainty propagation frameworks are promising, but they're research results on math problems, and I'm not sure they transfer to messy, multi-tool, real-world agents.

I don't know how to set pass/fail thresholds when variance is inherent. A 75% success rate on the same eval across 20 runs: acceptable for what kind of task? I don't have good intuitions here.

I don't know whether "eval coverage" even makes sense for agents. Maybe the right model isn't coverage at all but stress testing: not "have we tested enough cases?" but "have we tested the cases that would break the system in the worst ways?"

And honestly, I don't know whether the field will converge on something principled or whether agent evaluation will remain permanently artisanal, a craft skill that depends on experience and domain knowledge in ways that resist systematization. That second possibility is the one that makes me most uncomfortable, because it implies a ceiling on how reliable agents can get without massive investment in human oversight.

Three predictions:

Prediction 1: By end of 2027, trajectory-level evaluation will be a default feature in at least two major agent development platforms. OpenAI has already built trace grading into their agent eval tools. The infrastructure is being built. 70% odds.

Prediction 2: By 2028, the most common "agent failure" category in production postmortems will be trajectory-level (composition of correct steps producing wrong outcomes), not step-level. If step-level failures remain dominant, either agents aren't being deployed in complex enough workflows or I'm wrong about where failures concentrate. 60% odds.

Prediction 3: We will not have a general-purpose automated trajectory evaluator that works across domains by 2029. Domain-specific trajectory evals will exist, but no equivalent of "unit test framework" for trajectories that you can apply without deep domain knowledge. 80% odds. If I'm wrong, it'll be because LLM-as-judge approaches are better at trajectory evaluation than I currently expect.

I keep returning to a question I can't resolve. In traditional software, we test deterministic systems with deterministic tests and achieve high confidence. In distributed systems, we learned to test non-deterministic systems with probabilistic tools and achieve reasonable confidence. Agent evaluation might need something beyond both: a way to evaluate systems that don't just behave non-deterministically but make decisions non-deterministically, where the decision space is too large to sample and the failure modes are invisible by design.

I'm not sure that tool exists yet. I'm not sure it can. But I'm building agents anyway, evaluating them with the best methods I have, and staying honest about how much I'm still guessing.