Session

Stop Chasing Leaderboards: A Tier List of Benchmarks That Matter

The AI industry at the moment is addicted to benchmarks and many of them are measurement "theatre". I've seen everything from how an LLM has the IQ of a professor or why its going to "replace doctors", and best of all play Pokemon, but still can't do basics right?

This talk proposes a tiered framework for benchmarks based on how well they predict real-world value: from toy tasks and closed-world puzzles, through tool-use and long-horizon reliability tests, to economic and productivity outcomes. Let's disect how benchmarks work and where they make sense and where they don't.

What they’ll leave with:
- Tier list of benchmarks, practical understanding of how to differentiate
- How to evaluate benchmarks (benchmark for benchmarks)
- How to re-think this problem and focus on building your own (practically)

Vincent Koc

Distingushed AI Research Engineer, Professor and Keynote Speaker (TEDx, SXSW)

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top