Session
Beyond Vibe Checks: Evals for Reliable Agents
Your agent demo worked. Then it shipped. Hallucinations don't throw exceptions. Prompt drift doesn't break builds. Prompt injection doesn't trigger your alerts. Teams end up shipping "it seems fine" — because they have no evals, just instincts.
Agents that act — call tools, run code, hit APIs — fail in ways that look like success right until they don't. This session applies the testing discipline engineers already know to the one system that needs it most: model behavior.
You'll get a practical crash-test toolkit: code-based invariant checks, human gold-standard sampling for edge cases, and model-graded evals that scale. Live demo: define failure modes for a tool-calling agent, write semantic unit tests, iterate on prompts based on results, and wire everything into CI so a quality drop kills the PR — not the product.
What you'll learn:
- Distinguish benchmarks from app-level evals
- Design a golden dataset for their own agent
- Implement code-based, human, and model-graded checks
- Integrate evals into CI/CD pipelines
- Move from "it seems fine" to "we have proof"
Ron Dagdag
Microsoft AI MVP and Research Engineering Manager @ Thomson Reuters
Fort Worth, Texas, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top