Building Better AI Evals: Lessons from the Trenches
Evaluation has quietly become the backbone of modern AI products. It's what separates a system that "looks cool in demos" from one that actually works in production. Yet, most teams I talk to underestimate evaluation — or worse, treat it as a final step rather than a continuous feedback loop.
Over the past year, I've seen the same mistakes repeated across companies building LLM applications: vague metrics that don't reflect real-world performance, over-reliance on off-the-shelf scores like ROUGE or BERTScore, and teams chasing model upgrades instead of deeply understanding their failure modes.
This post aims to offer a mental model for AI evals: what they are, why they matter, and how to build a system that's both rigorous and pragmatic. It draws on three guiding principles:
- Error analysis is the starting point of all good evals.
- Custom metrics beat generic ones every time.
- Evals are part of development, not a side project.
Why Evals Matter More Than Ever
The old way of evaluating AI — benchmarks and leaderboards — was built for research. Benchmarks like SQuAD or GLUE are static and narrow; they don't reflect the messy, ambiguous problems faced in the wild.
In production, the success of your AI system isn't defined by hitting a benchmark score. It's whether your customer gets a correct answer, a safe answer, and an answer that reflects your product's voice and constraints. A model can ace MMLU and still hallucinate your refund policy or suggest a house tour for a property that's already sold.
This is why evaluation for LLM applications needs to be application-specific. A retrieval-augmented generation (RAG) pipeline, a multi-turn assistant, or an autonomous agent each needs its own evaluation lens. There is no universal metric that captures "quality" across all these cases.
When teams skip custom evals, they fall into a dangerous trap: assuming that "better models" will solve their problems. But without the feedback loop of evaluation, you can't even tell if an upgrade fixed the issues that matter to your users.
The Heart of Evals: Error Analysis
If there's one practice that transforms how teams think about AI systems, it's error analysis. Instead of trying to solve evaluation from the top down — by picking a metric or building a judge model — error analysis starts from the ground up.
The process is deceptively simple:
- Collect a representative set of traces from your system (user queries, model outputs, and tool calls).
- Manually review them to identify where and how the system fails.
- Group those failures into a taxonomy.
This taxonomy becomes your evaluation blueprint. It reveals the real failure modes — not the abstract ones suggested by vendor dashboards. For a legal assistant, the failure modes might be "misstating jurisdiction," "missing deadlines," or "inconsistent tone." For a coding agent, it might be "incorrect parameter inference," "failure to recover after API error," or "looping on irrelevant commands."
The key insight: you cannot design meaningful metrics until you've done error analysis. Generic metrics like "helpfulness" or "fluency" won't tell you if your agent is booking meetings for the wrong dates.
Binary Beats Likert
Most teams default to 1–5 rating scales because they feel more "granular." But in practice, Likert scales introduce noise and subjectivity: what's the difference between a 3 and a 4? Different annotators will answer differently, and even the same person's judgment can shift day to day.
Binary evaluations — pass or fail — are faster, clearer, and more consistent. They force you to answer the only question that matters: "Did this output meet the bar?"
When you need to measure gradual improvements, you can decompose quality into binary sub-checks. For instance, instead of rating factual accuracy on a scale, you might check if the output includes all required facts, one by one. Five facts, five binary checks. The signal is sharper, and the evaluation becomes more actionable.
RAG, Retrieval, and Context
There's been a lot of noise lately about "RAG being dead." This stems from a misunderstanding of what RAG actually is. Retrieval-Augmented Generation isn't about vector databases or embeddings per se; it's about getting the model the right context to produce a good answer.
What's really "dead" is naive RAG — blindly stuffing the top-k chunks from a vector store into your prompt and hoping for the best. Code assistants and advanced agents have shown that smarter retrieval strategies — like multi-hop search, agentic exploration, or hybrid retrieval — outperform brute-force approaches.
When evaluating RAG, it helps to separate the problem into two layers:
- Retrieval: Are we surfacing the right documents? (Recall@k, Precision@k, MRR)
- Generation: Given that context, is the answer faithful, accurate, and relevant?
Treat retrieval like an information retrieval (IR) problem, with proper metrics. Then, evaluate generation as you would any LLM task — through error analysis and custom judges.
Guardrails vs. Evaluators
A common misconception is that evaluators and guardrails are interchangeable. They're not.
- Guardrails sit in the request/response path, blocking unsafe or malformed outputs before they reach the user. Think regex filters for PII, profanity checks, or schema validation. They need to be deterministic and fast, with extremely low false positive rates.
- Evaluators run after the fact. They're how you measure quality, diagnose failures, and monitor regressions. They can afford to be slower and more probabilistic, often leveraging LLM-as-judge setups.
A healthy system has both: lightweight guardrails for safety, and evaluators for improvement. Without evaluators, you're blind to your model's weaknesses. Without guardrails, you're leaving users exposed to high-impact failures.
The Cost of Evaluation
Evaluation isn't a separate "project." It's a core part of building AI systems, just like debugging is for traditional software. In most high-performing teams, 60–80% of the time is spent on error analysis and evaluation. That's not a bug — it's the work.
The temptation to automate everything early is strong. But premature automation often hides more than it reveals. Automated LLM judges can't tell you why your system fails unless you've first done the human work of categorizing failure modes. Start with 20–50 examples manually reviewed by a domain expert. Build simple assertions or tests for obvious issues. Only then should you invest in heavier evaluators.
From CI to Production Monitoring
Evaluation isn't static. It evolves as your system moves from development to production.
- In CI/CD, you run curated tests — 100 or so carefully chosen cases that represent your core features and known edge cases.
- In production, you sample live traces and score them asynchronously. Here, evaluators shift from correctness to monitoring trends: Are failure rates spiking? Are we seeing new classes of errors?
The feedback loop between the two is crucial. When you find a new failure mode in production, add an example to your CI dataset. This prevents regressions and keeps your tests grounded in real-world data.
The Future of AI Evals
Most evaluation tools today focus on dashboards and generic scores. But the real innovation will come from AI-assisted evaluation itself.
Imagine a pipeline where an LLM clusters failure patterns, suggests test cases, and even drafts prompt modifications based on observed weaknesses. We're already seeing glimpses of this — error analysis assisted by semantic search, clustering, and automated tagging — but the tooling is still immature.
The teams that master evaluation will have an outsized advantage. Why? Because LLM performance is no longer constrained by model quality alone. It's constrained by how well we can observe, debug, and shape these models into reliable systems.
Closing Thoughts
The best eval setups are deceptively simple: a domain expert reviewing traces, a small set of pass/fail checks, and a feedback loop that turns failures into test cases. Over time, this grows into a living evaluation system — one that's tightly aligned with your product's needs, not with whatever metrics happen to be trending on GitHub.
If you take away one thing from this post, let it be this: Evaluation isn't an afterthought. It's the work.