Session
When is a Model Good Enough? A Year of Speech-to-Text at UDI
When is a speech-to-text model actually good enough?
Norwegian asylum interviews are among the hardest speech-to-text scenarios a government can face. Three participants. Two languages switching mid-sentence. An interpreter whose code-switching breaks most ASR models. A legally binding record where a mistranscription can shape the outcome of a protection claim. And a data regime where the vendors you most want to test are the ones you are least allowed to test on real data.
Over the past year at UDI, we ran a pilot to answer exactly that question. We evaluated nine systems on theater interviews and historical audio, iterated through four architectures on Azure, from GPU VMs running our own models to an external transcription API, ran seven live asylum interviews with the one model our privacy approvals allowed, and wrote our own speaker separation when off-the-shelf tools failed. The honest answer at the end of the pilot: not yet, but closer than before.
This talk walks through what we learned, both technically and about the three people in the room. What worked in the models and what plateaued. What the live interviews revealed that no benchmark could. And how real-time transcription changes the work of the caseworker, the interpreter, and the person being interviewed. If you are building STT for Norwegian, or evaluating AI systems in regulated domains, this session shows you what that work actually looks like from the inside of a real government project
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top