Session

Who Let the Bots Out? A Guide to Evaluating AI Agents

In this talk, I will present a systematic, open source framework for evaluating Gen AI agents—LLM-based systems that manage complex, multi-step tasks—by dissecting their performance into three critical dimensions.

First, I detail how to evaluate tool use by examining each step, from tool selection and parameter capture to tool execution, ensuring that every individual component operates as intended.

Next, we'll understand how trajectory evaluation scrutinizes the agent’s overall workflow, verifying that it adheres to an optimal and efficient sequence of actions.

Finally, I show Goal Evaluation strategies to quantitatively determine if the agent achieves the specified outcomes.

This approach not only identifies failure points across the evaluation dimensions but also provides actionable insights for iterative improvements.

Attendees will gain a robust, reproducible methodology to benchmark and optimize AI agents, bridging the gap between experimental development and reliable production deployment.

Josh Reini

Developer Advocate for Open Source AI @ Snowflake, TruLens Maintainer

Atlanta, Georgia, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top