Scaling ML Smarter: Optimizing Kueue & Volcano with Adaptive Scheduling

Kueue and Volcano are leading the charge in orchestrating large-scale distributed ML jobs. But are they truly maximizing your GPU resources? Traditional batch scheduling methods often suffer from inefficient queue management, and rigid allocations that fail to adapt to real-time demand resulting in problems that scale with workloads.

This talk dives into how priority-aware queueing and elastic resource allocation can supercharge Kueue and Volcano, making batch scheduling more adaptive and efficient. We’ll break down the scheduler’s architecture, exploring how jobs dynamically move between priority queues, how elastic scheduling adjusts resource allocations in real time, and how these improvements lead to faster job execution and better GPU utilization.

Whether you're managing distributed training, hyperparameter tuning, or large-scale inference pipelines, this talk will provide the tools and strategies needed to unlock smarter scheduling and maximize ROI on Kubernetes GPU workloads.

Nikunj Goyal

Member of Technical Staff

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Scaling ML Smarter: Optimizing Kueue & Volcano with Adaptive Scheduling

Nikunj Goyal

Links

Actions