Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Scaling Automated Error Analysis
Learn practical methods for scaling error analysis in complex agents, including effective context engineering and task decomposition, and understand approaches that fail.
Error analysis is known to be the highest ROI step in building reliable agents that work. However, as agent & task complexity grow, manual approaches to it become prohibitive, and naive ways of automating it remain unsatisfying. How should developers scale up this key step in agent evaluation?
Here, I’ll present our recent findings on what works - and what doesn’t - when automating error analysis. TL;DR: we can do a lot better than stuffing everything into an LLM, with a bit of intentional context engineering and task decomposition.
Automating agent error analysis uses span-level LLM critiques and Hungarian cluster matching.