What makes a calibrated LLM Judge?

One recurring question from the entire AI community over the past 2 years for using LLM-as-judges has has been their credibility. Namely, “how can we trust a LLM-as-judges, if autoregressive LLMs themselves are stochastic in nature and hence inherently unreliable?”

To answer this, we first need empirical foundations and benchmarks to help establish strong correlation between human evaluation to automatic evaluators.

There are many levers we can take advantage of to improve the reliability of LLM judges, including the underlying LLM used, evaluation output scale, evaluation criteria, model parameters like temperature, chain of thought reasoning and more.

This talk walk through how to benchmark an LLM judge against human evaluators, which levers are available to pull, and how to systematically experiment with these different levers to improve calibration.

Josh Reini

Developer Advocate for Open Source AI @ Snowflake, TruLens Maintainer

Atlanta, Georgia, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

What makes a calibrated LLM Judge?

Josh Reini

Links

Actions