Evaluating Agents Is a Different Problem
Every component eval passed. The agent pipeline still failed. What changes when you move from evaluating single calls to evaluating trajectories.
Engineer at heart • Exploring AI, robotics & the human experience
I love building things, exploring, and learning. I write about robotics, AI, spiritual growth and mental health.
This space is where I explore the intersection of technology and humanity, from training neural networks to understanding our own neural pathways.
2025 — Enterprise RAG with 89% accuracy. FastSentiment API with sub-100ms inference. AI research agent processing 100+ papers daily.
2024 — ML systems engineering. Traffic sign classification at 96% accuracy. Attention mechanism visualizer for education.
Every component eval passed. The agent pipeline still failed. What changes when you move from evaluating single calls to evaluating trajectories.
LLMs pass the bar exam but can't count letters. The failures aren't random. They're architectural fingerprints, and understanding them changes how you use these systems.
The hardest part of reinforcement learning isn't the algorithm. It's knowing what you actually want, and whether that can even be formalized.