Session
Scaling AI Inference Across Heterogeneous GPUs with Ray on Kubernetes
In production AI systems today, 60–70% of inference workloads are tightly coupled to a single GPU vendor or instance type, leading to 30–50% higher infrastructure costs, poor portability, and operational friction when scaling across cloud and on-prem environments. As demand grows, teams face a choice: lock in deeper or redesign for flexibility.
This presentation presents a GPU-agnostic inference architecture built with Ray on Kubernetes, designed to run reliably across heterogeneous accelerator clusters. By decoupling application logic from hardware assumptions and leveraging Ray’s distributed execution with Kubernetes-native scheduling, teams can scale inference without rewriting pipelines for each GPU type.
Using a production-grade reference architecture, we’ll show how inference traffic flows through Ray Serve, how workloads scale across mixed CPU/GPU nodes, and how concurrency, fault tolerance, and autoscaling are handled under real-world load. We’ll also demonstrate how KubeRay reduces operational overhead by managing Ray clusters through Kubernetes-native lifecycle controls.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top