Philip Kiely

Head of Developer Relations at Baseten

Actions

Philip Kiely leads Developer Relations at Baseten. Prior to joining Baseten in 2022, he worked across software engineering and technical writing for a variety of startups. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted bay area sports teams.

The Golden Triangle of Inference Optimization: Balancing Latency, Throughput, and Quality

Running high-performance production deployments for LLMs and other generative AI models requires balancing latency, throughput, and quality given constraints around hardware, model selection, and cost.

Assembling the correct set of optimization methods, from model-level optimizations like quantization and speculative sampling to fast inference frameworks like vLLM/TensorRT-LLM/SGLang to novel methods like prefix caching and chunked context is tricky, but is essential for serving AI models effectively at scale.

This talk introduces a wide variety of proven optimization techniques, but more importantly provides a clear framework for categorizing, understanding, and selecting which techniques to use based on the objectives you're optimizing towards.

Optimizing inference for voice models in production

How do you get time to first byte (TTFB) below 150 milliseconds for voice models -- and scale it in production? As it turns out, open-source TTS models like Orpheus have an LLM backbone that lets us use familiar tools and optimizations like TensorRT-LLM and FP8 quantization to serve the models with low latency. But client code, network infrastructure, and other outside-the-GPU factors can introduce latency in the production stack. In this talk, we'll cover the basic mechanics of TTS inference, common pitfalls to avoid in integrating them into production systems, and how to extend this high-performance system to serve customized models with voice cloning and fine-tuning.

Introduction to LLM serving with SGLang

Do you want to learn how to serve models like DeepSeek and Qwen with SOTA speeds on launch day? SGLang is an open-source fast serving framework for LLMs and VLMs that generates trillions of tokens per day at companies like xAI, AMD, and Meituan. This workshop guides AI engineers who are familiar with serving models using frameworks like vLLM, Ollama, and TensorRT-LLM through deploying and optimizing their first model with SGLang, as well as providing guidance on when SGLang is the appropriate tool for LLM workloads.

Philip Kiely

Head of Developer Relations at Baseten

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Philip Kiely

Actions

Links

Sessions

The Golden Triangle of Inference Optimization: Balancing Latency, Throughput, and Quality

Optimizing inference for voice models in production

Introduction to LLM serving with SGLang

Philip Kiely

Links

Actions