Session

Unleashing the power of DRA (Dynamic Resource Allocation) for just-in-time GPU slicing

AI/ML experts leveraging Kubernetes clusters to train, fine-tune, or serve large language models (LLMs) would like to dynamically allocate GPUs and GPU slices based on the demand of their workloads. The DRA (Dynamic Resource Allocation) approach currently developed by the community is promising but will require changes to Kubernetes scheduling mechanisms with the introduction of latency-inducing roundtrips between schedulers and DRA controllers. Moreover, GPU slices have to be requested by means of novel resource classes and claims, requiring users to adapt.
This talk demonstrates how we exploit DRA today to enable just-in-time GPU slicing on large production Kubernetes clusters running a mix of small fractional and large distributed workloads. InstaSlice acts on queued AI workloads to slice GPUs with the help of DRA. By augmenting DRA with InstaSlice, we make it simple for users to leverage DRA with zero changes to queued workloads and zero changes to Kubernetes schedulers.

Abhishek Malvankar

Senior Software Engineer, Master Inventor at IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top