Session

Sentence by sentence: a human/ai eval approach for the accuracy-obsessed

We built an AI chat app for Canada.ca with evals for individual sentences and citations instead of whole answers. This approach created a feedback loop delivering 95% accuracy on a huge range of topics and questions.

When citizens need tax filing deadlines or immigration paperwork, being mostly right isn't good enough - a single wrong sentence in an otherwise correct answer can lead to missed benefits or botched applications.

So we asked: what if we evaluated AI responses the way users actually read them - one sentence at a time? We created a dual system where human experts score each component through a natural chat interface, while vector embeddings automatically apply these detailed evaluations to similar future responses.

I'll walk you through our implementation - from the scoring components that make expert evaluation feel like a regular conversation, to the embedding-based auto-eval system that scales human feedback across thousands of gpt/claude api tool interactions. You'll see real code examples, practical design patterns, and how this approach created a virtuous cycle where evals directly improve the system rather than just measuring it.

Lisa Fast

AI/UX Architect, @lisafast on github

Ottawa, Canada

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top