Session
Beyond Vibes: Rigorous Evals for Your AI Coding Agent
Developers using AI coding agents daily don't realize they're sitting on a treasure trove of data: the sessions and traces from every coding agent interaction. These tell you not just if an agent solved a problem, but how. Things like how many tool calls it took, where it got stuck, when it backtracked, and what it wasted tokens on.
Most teams treat a passing test as success. But a 3-step clean solve and a 40-step flailing recovery that happened to pass look identical from the outside and cost very differently in time, tokens, and trust.
This talk shows you how to harness session data from tools like Claude Code, Codex, OpenCode, and pi to actually measure and improve agent behaviour. We'll cover building offline evals from real session traces, setting up production eval tracing to catch regressions, and identifying the specific levers that move the needle. You'll leave with a practical framework to multiply the efficiency of your AI-driven development by helping your coding agent make better decisions, not just correct ones.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top