Any Accelerator, Any Time: Mastering Cost-Efficient GenAI Inference at Scale

As generative AI adoption accelerates, organizations face mounting challenges in delivering cost-effective inference at scale. This session demonstrates how to build an intelligent, multi-accelerator inference platform that automatically optimizes for cost and performance. We'll show how to orchestrate popular open-source models like Llama, Mistral, and Stable Diffusion across heterogeneous compute resources - from NVIDIA GPUs to AWS Inferentia - using Kubernetes as our foundation. Through practical examples, attendees will learn how to implement dynamic workload scheduling that responds to real-time demand using KEDA and Prometheus metrics. We'll explore how Karpenter enables automated provisioning of the right compute resources at the right time, and how frameworks like Neuron and Triton can be leveraged for optimal performance across different accelerators. You'll leave with actionable insights on building scalable, cost-efficient GenAI inference pipelines that can intelligently adapt to varying workload demands while maintaining performance SLAs.

Tsahi Duek public speaking playlist
https://youtube.com/playlist?list=PLxuZS4zp_O2QkooK9cBkgYWr-5R3Y4mWC&si=_LoUPzA5ZrwUpVcd

Tsahi Duek

Amazon Web Services, Principal Specialist Solutions Architect, Containers

City of London, United Kingdom

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Any Accelerator, Any Time: Mastering Cost-Efficient GenAI Inference at Scale

Tsahi Duek

Links

Actions