AI POCs Are Easy. Production Is Hard. Evaluation Closes the Gap.

Generative AI prototypes are easy. With a few prompts and a model endpoint, teams can create impressive demos in minutes. But once these systems meet real users and real data, the cracks appear: retrieval pipelines drift, responses hallucinate, costs and latency fluctuate, and agent workflows take unexpected paths.

The gap between a compelling POC and a reliable production system is rarely the model. It’s the absence of systematic Evaluation.

This session introduces Evaluation Driven Development as a practical engineering discipline for production AI systems. Using tools like the Microsoft's Evaluation SDK and Azure AI Foundry, we’ll explore how developers can instrument AI applications with automated evaluators to measure quality and safety.

From there, we’ll examine how evaluation applies across modern AI architectures including RAG pipelines, tool-calling agents, and multi-step reasoning workflows. You’ll see how to design evaluation datasets, run automated evaluation pipelines, and integrate these checks into CI/CD so changes to prompts, retrieval, or orchestration can be validated before reaching production.

Scaling AI systems isn’t about better demos. It’s about trust. Evaluation closes the gap.

Oct 7- 12 pm CT

Liji Thomas

Gen AI Manager- HRBlock, MVP (AI)

Kansas City, Missouri, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

AI POCs Are Easy. Production Is Hard. Evaluation Closes the Gap.

Liji Thomas

Links

Actions