Evaluating AI agents isn’t as simple as running unit tests.

Unlike traditional software, where behaviour is deterministic and test outcomes are predictable, AI agents operate in dynamic, non-deterministic ways.

They can take different paths, make uncertain decisions, or even call the wrong tools, all while trying to complete a task.

In this talk, we’ll explore why standard benchmarks like MMLU or HellaSwag, designed for LLMs, fall short for agentic systems.

We’ll dive into "LLM as a Judge", Code based eval and human annotations techniques

Nikhilesh Tayal

Google Developer Expert for I. Co-founder AI ML etc. (an AI enabled edtech platform). 3xEntrepreneur. Guest Faculty - Generative AI @ IITs/ NITs. 70+ speaking assignments.

Udaipur, India

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Evaluating AI agents isn’t as simple as running unit tests.

Nikhilesh Tayal

Links

Actions