Beyond Vibe Checks: Evals for Reliable AI (and Agents)

AI that can act (call tools, run code, hit APIs) is powerful—and risky. Vibe-checking outputs doesn’t scale when agents plan, execute, and adapt on their own. In this talk, we replace guesswork with evaluations: the crash-tests for agent behavior. You’ll learn a practical toolkit of code-based checks (rules and invariants), human reviews (gold-standard sampling), and model-graded evals (scalable judges with guardrails). We’ll run a live demo: define failure modes, write semantic unit tests, iterate prompts/policies, and wire the evals into CI so regressions get caught before users do. You’ll leave with a small template you can drop into your agent stacks to measure reliability, reduce cost loops, and keep actions safe—moving from “it seems fine” to “we have proof.”

Ron Dagdag

Microsoft AI MVP and R&D Engineering Manager @ 7-Eleven

Fort Worth, Texas, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Beyond Vibe Checks: Evals for Reliable AI (and Agents)

Ron Dagdag

Links

Actions