Session

The Evaluation Gap: how do you actually know your AI feature works?

A room full of senior engineers and CTOs, every one of whom had shipped an AI feature to production, and not one could tell you whether it actually worked. They knew it ran. Running and working, as anyone who has owned an old car understands intimately, are different things. That's the gap: not a knowledge gap, a measurement gap: a whole industry steering by a dashboard where the needles are painted on.
“You can't unit test a model” is a comforting lie, true in roughly the way “you can't measure the ocean with a ruler” is true. You're not testing the model; you're testing the behaviour of your feature, which has acceptance criteria even if you've never written them down (doesn't leak the system prompt, returns valid JSON, declines to advise the customer to set themselves on fire).
We swap assertEquals for evaluation suites, baselines and the occasional LLM-as-judge. You'll leave able to answer “does it work?” with something more dignified than a hopeful shrug.

Dennis Vroegop

Building AI that actually ships, and the people who build it. Mostly harmless

Melbourne, Australia

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top