Session
Beyond Vibe Checks: Evals for Reliable AI (and Agents)
AI that can act (call tools, run code, hit APIs) is powerful—and risky. Vibe-checking outputs doesn’t scale when agents plan, execute, and adapt on their own. In this talk, we replace guesswork with evaluations: the crash-tests for agent behavior. You’ll learn a practical toolkit of code-based checks (rules and invariants), human reviews (gold-standard sampling), and model-graded evals (scalable judges with guardrails). We’ll run a live demo: define failure modes, write semantic unit tests, iterate prompts/policies, and wire the evals into CI so regressions get caught before users do. You’ll leave with a small template you can drop into your agent stacks to measure reliability, reduce cost loops, and keep actions safe—moving from “it seems fine” to “we have proof.”

Ron Dagdag
Microsoft AI MVP and R&D Manager @ 7-Eleven
Fort Worth, Texas, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top