Dynamic GPU Autoscaling: Leveraging KServe and NVIDIA DCGM for Cost Efficient scaling

Implementing dynamic GPU autoscaling for deferred inference may seem daunting, but with the right approach, it becomes a powerful way to boost performance while containing costs. By leveraging KServe or KEDA for serverless ML deployment and NVIDIA’s DCGM metrics, this system scales GPU resources in real time based on actual utilization rather than simple request counts. A custom metrics adapter feeds DCGM_FI_DEV_GPU_UTIL data into Kubernetes’ Horizontal Pod Autoscaler (HPA), ensuring GPU capacity matches computational needs. Asynchronous prediction endpoints, coupled with scaling algorithms that factor in memory usage, compute load, and latency, deliver near-optimal resource allocation for complex workloads like object detection. This talk explores the technical steps behind utilization-based autoscaling with KServe or KEDA, including monitoring, alerting, and performance tuning. Real-world benchmarks from production show up to 40% GPU cost savings without compromising inference speed or accuracy. Attendees will learn practical methods for bridging ML frameworks and infrastructure, making cloud GPU-accelerated ML more accessible and efficient in modern cloud-native environments

Prashant Ramhit

Mirantis Inc. Platform Engineer | Snr DevOps Advocate | OpenSource Dev

Dubai, United Arab Emirates

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Dynamic GPU Autoscaling: Leveraging KServe and NVIDIA DCGM for Cost Efficient scaling

Prashant Ramhit

Links

Actions