Evaluating Agents Is a Different Problem
Every component eval passed. The agent pipeline still failed. What changes when you move from evaluating single calls to evaluating trajectories.
Engineer at heart • Exploring AI, robotics & the human experience
Every component eval passed. The agent pipeline still failed. What changes when you move from evaluating single calls to evaluating trajectories.
LLMs pass the bar exam but can't count letters. The failures aren't random. They're architectural fingerprints, and understanding them changes how you use these systems.
The hardest part of reinforcement learning isn't the algorithm. It's knowing what you actually want, and whether that can even be formalized.
Watching AI tools solve the same problem in radically different ways reveals something about their cognitive architecture, and about the nature of problem-solving itself.
My job quietly changed underneath me. I used to write code. Now I write instructions for systems that write code. Tracing a 50-year abstraction trend to understand what engineering is becoming.
What it's like to build AI systems that automate your own skills, and the quiet fear that doesn't fit neatly into hype or doom.
AI is collapsing the cost of building software, and with it, the value of technical advantages. If building is cheap, distribution, the ability to reach and retain people, becomes the only defensible position. An engineer wrestles with what that means.
Most of building AI agents is debugging JSON. But sometimes you remember what you're actually building, and the gap between those two realities is where the real questions live.
LLMs don't remember anything, yet agents built on them seem to learn and retain information. The gap between these facts reveals something surprising about what memory actually is.
As AI handles more code generation, the human skill shifts from creation to curation. What is engineering taste, why can't AI have it, and what does that mean for us?
Children forget everything yet learn faster than any AI. What are they exhibiting about learning that we've failed to capture in our models?
Most AI engineers build evals like unit tests. That's why they fail. What changes when you treat evals as hypotheses about what matters instead of tests of model quality.
As the agentic ecosystem matures, tools are no longer scarce. They're everywhere. The hard part now isn't wiring up tools — it's helping models discover which ones to use.
In the past year, agent architectures have gone from niche experiments to front-page product strategies. But one area remains dramatically under-discussed: context engineering.
Evaluation has quietly become the backbone of modern AI products. It's what separates a system that 'looks cool in demos' from one that actually works.
Over the next 10 years, the GenAI landscape won't be shaped by prompt hacks or viral demos. It will be defined by who builds the infrastructure, systems, safety nets, and experiences that actually ship and scale.
A deep dive into how companies are actually using large language models in production, from GitHub Copilot writing 46% of code to enterprises struggling with hallucination rates of 27%
An in-depth analysis of the LLM ecosystem in May 2023, from Geoffrey Hinton's dramatic Google exit to the $50 billion funding frenzy reshaping Silicon Valley's power structure
This is the first of a multi-part series exploring exciting new developments in AI. A deep dive into the models that power ChatGPT.