Session
Measure Your Magic: Evaluations That Scale Generative AI
Generative AI often feels like magic — surprising, creative, and full of potential. But magic alone doesn’t scale. Without the discipline of measurement, prototypes stall, trust erodes, and production never arrives. To build reliable, enterprise-grade AI, you have to measure your magic.
This session introduces the Microsoft.Extensions.AI.Evaluation libraries, designed to simplify the process of evaluating model outputs in Gen AI apps. These libraries provide a robust foundation for evaluating key dimensions like relevance, truthfulness, coherence, completeness, and safety. They offer a range of built-in quality, NLP, and safety evaluators — with the flexibility to customize and add your own.
And as agentic AI becomes the new rave — applications that plan, reason, and take multi-step actions autonomously — evaluation becomes even more critical. We’ll explore how to extend evaluation practices beyond static responses to agent workflows, action orchestration, and decision-making chains.
By the end, you’ll know why the only way to scale AI with confidence is simple: measure your magic.
Key Takeaways
-Understand why evaluations are the foundation of LLM Ops — not an afterthought.
-Learn how to use Microsoft.Extensions.AI.Evaluation libraries to measure quality of AI responses.
-Discover how to evaluate agentic AI applications — from workflows to reasoning steps.
-Apply the principles of Evaluation-Driven Development (EDD) — designing evaluations first to guide how AI features are built and scaled.

Liji Thomas
Gen AI Manager- HRBlock, MVP (AI)
Kansas City, Missouri, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top