Bo Li
Voice Agents at EliseAI
Actions
Bo has over 10 years of experience building real time systems for databases, decentralized finance, self driving cars, and voice AI. He previously worked as an Member of Technical Staff at Cartesia and is currently at EliseAI, building Agents for biggest sources of consumer spending: Housing and Healthcare.
Realtime Voice Agents with Frontier Intelligence
Dive into how the EliseAI voice agent harness orchestrates multiple models with jagged capability profiles to achieve realtime latency without sacrificing intelligence. Reduces p90 effective latency overhead of ASR, TTS, and tool calling to sub 200ms, unlocking frontier models like GPT 5.5 for voice.
### ASR: Eager Speculative Transcription
We introduce speculative transcription by pairing local Whisper or Parakeet fine-tunes for speed with API models like Scribe, Nova, or Gemini Flash for accuracy. A local content match classifier operates at sub 10ms latency, allowing us to immediately trigger the downstream pipeline from the fast local transcription and dynamically replace text with the more accurate transcription if significant differences occur. This process runs on a eager 100ms VAD delay, securely releasing the generated response audio only after a fixed silence threshold has passed.
### LLM: Async background tool injection
To eliminate expensive tool calling round trips, we implement system leveraging async background tool injection where the primary model makes no direct tool calls. Instead, local fine-tuned tool-calling models continuously observe the realtime transcription stream in the background. "Fake" tool call traces are then injected into the primary LLM’s context, which primes it for immediate, one-shot response generation.
### TTS: Prefix caching and infilling
Many Agent responses start with the same set of 3-6 words. We can cache this audio, releasing it immediately while we infill the remaining response audio conditioned on this prefix to preserve speech prosody. With this approach, a relatively small cache can achieve a 90% hit rate across a wide range of voices, languages and model providers.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top