From Lab to Life: Practical AI System Evaluation

Agentic AI systems are a significant evolution from single-model GenAI Chatbots, but their dynamic and unpredictable nature in the real world introduces significant operational, reputational, and financial risks for enterprises. This "reality gap" is a critical blind spot that static, pre-deployment benchmarks like MMLU—with their fixed datasets—fail to address.
We propose a practical approach inspired by the framework suggested by the University of Michigan in their paper: "Evaluation Framework for AI Systems in the Wild".
The authors’ advocacy for holistic frameworks that integrate performance, fairness, and ethics can be seen as a foundation for a risk-adjusted evaluation. Their suggested use of continuous, outcome-oriented methods that combine human and automated assessments while also being transparent can increase trust among stakeholders.

We will break down the principles of the framework and provide practical, actionable approaches for a risk-adjusted evaluation using the best of open-source technologies. We will explore how to apply these evaluation methods throughout the entire AI system development lifecycle, from inception to continuous, real-world monitoring.

Vincent Caldeira

Leading Open Source Technology Innovation for a Sustainable Future

Singapore

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From Lab to Life: Practical AI System Evaluation

Vincent Caldeira

Links

Actions