Session
What makes a calibrated LLM Judge?
One recurring question from the entire AI community over the past 2 years for using LLM-as-judges has has been their credibility. Namely, “how can we trust a LLM-as-judges, if autoregressive LLMs themselves are stochastic in nature and hence inherently unreliable?”
To answer this, we first need empirical foundations and benchmarks to help establish strong correlation between human evaluation to automatic evaluators.
There are many levers we can take advantage of to improve the reliability of LLM judges, including the underlying LLM used, evaluation output scale, evaluation criteria, model parameters like temperature, chain of thought reasoning and more.
This talk walk through how to benchmark an LLM judge against human evaluators, which levers are available to pull, and how to systematically experiment with these different levers to improve calibration.

Josh Reini
Developer Advocate for Open Source AI @ Snowflake, TruLens Maintainer
Atlanta, Georgia, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top