Session

From Vibes to Verification: Testing AI Agents Before Production

AI agent demos are easy to make impressive and hard to make trustworthy. Once an agent can plan steps, call tools, retrieve data, and act on behalf of a user, “the answer looked good” is no longer a sufficient test strategy. This session shows engineers and engineering leaders how to move from vibe checks to repeatable verification before putting agents into production.

We will use a triage agent as the running example. The demo starts with a working agent that classifies tickets, looks up policy, drafts a response, and escalates risky cases. Then we break it with ambiguous instructions, bad tool output, missing context, and prompt-injection attempts. From there, we add task contracts, scenario tests, adversarial cases, trace review, tool-call assertions, and release gates.

Attendees will learn how to explain the difference between LLM evals and agent evals, compare unit, scenario, and red-team tests for agent workflows, implement a small evaluation suite around tool use and decision quality, evaluate logs and traces for production readiness, and troubleshoot common failure modes before customers find them.

The goal is not to promise perfect agents. It is to show a practical path for shipping agentic systems with evidence, boundaries, and confidence.

Ron Dagdag

Microsoft AI MVP and Research Engineering Manager @ Thomson Reuters

Fort Worth, Texas, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top