Session

You Need Judge Agents - LLM-as-Judge Is Not Enough

LLM-as-Judge is everywhere and quietly failing. Single-pass judging breaks on long-horizon tasks, tool outputs, and practically anything that needs complex internal domain knowledge. As agents become multi-step systems, evaluation must become multi-step too. That means agentic judges: judges that can decompose criteria, verify claims, call tools, cross-check sources, and produce
auditable evaluations.

This session gives a deep dive into judge architectures: rubric-based judges, pairwise and tournament ranking, self-consistency, reference-based vs reference-free scoring, tool-assisted verification, and adversarial “red team” judges to detect reward hacking (to name a few). We’ll cover calibration (how to make judge scores stable), governance (how to use in high risk enviroments and the practical infrastructure you need.

What they’ll leave with:
- A blueprint for building judge agents that are robust and auditable
- Simple visual architectural design patterns that are vendor and code agnostic
- How to calibrate your judges and evaluations as a process

This work is based on extensive published research stemming from academia and applied methods in a range of global organizations.

Vincent Koc

Distingushed AI Research Engineer, Professor and Keynote Speaker (TEDx, SXSW)

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top