Session

Your AI Doesn't Crash — It Just Lies

Your AI doesn't crash. It doesn't throw a stack trace or fail a build. It just quietly tells your users the wrong thing — with perfect confidence, polished grammar, and zero remorse. This is the new class of production bug: the kind that looks like success right up until it isn't.
Hallucinations, grounding failures, and prompt injection can't be caught by traditional testing. This session gives engineers a practical evaluation framework to close that gap. Starting from a knowledge assistant that passes all functional tests but quietly fails in production, we'll expose three real failure modes and turn each one into a repeatable eval check.
The demo covers golden dataset construction, pass/fail, rubric, and pairwise scoring patterns, and wiring evals into CI so a quality drop kills the PR before it ships. Tools include DeepEval, RAGAS, and eval patterns in pytest. No PhD required — if you've written a unit test, you're already halfway there.
You'll leave with a four-phase eval strategy and a working template you can drop into your stack this week. Because the question isn't whether your AI will lie. It's whether you catch it first.
Learning Outcomes
- Explain difference between benchmarks and application-level evals
- Implement a lightweight golden dataset and scoring rubric for a real AI feature
- Compare pass/fail, rubric, and pairwise scoring patterns and choose the right one
- Wire eval checks into a CI/CD pipeline so quality regressions break the build

Ron Dagdag

Microsoft AI MVP and Research Engineering Manager @ Thomson Reuters

Fort Worth, Texas, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top