Optimizing inference for voice models in production

How do you get time to first byte (TTFB) below 150 milliseconds for voice models -- and scale it in production? As it turns out, open-source TTS models like Orpheus have an LLM backbone that lets us use familiar tools and optimizations like TensorRT-LLM and FP8 quantization to serve the models with low latency. But client code, network infrastructure, and other outside-the-GPU factors can introduce latency in the production stack. In this talk, we'll cover the basic mechanics of TTS inference, common pitfalls to avoid in integrating them into production systems, and how to extend this high-performance system to serve customized models with voice cloning and fine-tuning.

Philip Kiely

Head of Developer Relations at Baseten

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Optimizing inference for voice models in production

Philip Kiely

Links

Actions