Session

How to Test AI Features Before Your Users Do

LLM-powered features often succeed in demos and still fail in production. A prompt change, model upgrade, or retrieval tweak can quietly reduce answer quality, break grounding, or introduce risky behavior. This talk shows engineers and tech leads how to build a practical eval loop for AI features using the same mindset they already bring to testing software.

Using an anonymized internal knowledge assistant as the running example, we will start with a small app that appears to work, then expose three realistic failure modes: unsupported claims, missed key information, and brittle behavior after a change. From there, we will build a lightweight eval suite that turns those failures into repeatable checks. The demo will cover a compact golden dataset, simple scoring patterns, and a regression workflow for prompts, models, and retrieval changes.

Attendees will leave able to explain the difference between benchmarks and app evals, compare pass/fail, rubric, and pairwise scoring, implement a small but useful eval dataset, and troubleshoot whether failures come from retrieval, generation, or overall system design.

Ron Dagdag

Microsoft AI MVP and Research Engineering Manager @ Thomson Reuters

Fort Worth, Texas, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top