Session

From Vibes to Proof: Crash-Testing Tool-Using AI in TypeScript

AI that can act—call tools, run code, hit APIs—is powerful…and risky. Vibe-checking outputs doesn’t scale once agents plan, execute, and adapt. In this talk, we replace guesswork with evaluations: the crash-tests for agent behavior. You’ll get a practical toolkit you can ship this week: code-based checks (rules and invariants in TypeScript), human reviews (gold-standard samples), and model-graded evals (scalable judges with guardrails).
Live demo: we define failure modes, write semantic unit tests around a tool-calling agent, iterate prompts/policies, and wire the evals into CI so regressions break the build—not production. We’ll show how to capture metrics (accuracy, safety violations, cost/latency), set thresholds, and visualize drift.
You’ll leave with a small TS template you can drop into your stack to measure reliability, cut cost loops, and keep actions safe—moving from “it seems fine” to “we have proof.”

Ron Dagdag

Microsoft AI MVP and R&D Manager @ 7-Eleven

Fort Worth, Texas, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top