Abhishek Malvankar
Senior Software Engineer, Master Inventor at IBM Research
Actions
Abhishek is a Senior Software Engineer, Master Inventor at IBM Research and co-chairs the CNCF Batch System Initiative. He works closely with Red Hat as a Partner Engineer. He focuses on resource management, performance, and distributed computing for AI workloads in the cloud. He enjoys designing easy-to-use solutions for the cloud and has 40+ patents filed. When possible he likes to explore different adventure sports and take culinary vacations.
Incremental GPU Slicing in Action
Large language models are often released as families of models with varying parameter counts and quantization. To reduce cost, inference services increasingly rely on dynamic model selection, preferring smaller models when possible. GPU vendors are on a journey to enable dynamic GPU slicing, making it possible for a workload to request a fraction of the compute and memory units in a GPU, and for the slices to be created and destroyed on demand without disrupting existing workloads. The onus is now on Kubernetes. The Device Management Working Group is hard at work to expose these capabilities. While vendor-agnostic slicing APIs do not exist yet, this talk demonstrates that incremental GPU slicing is possible today. We replace the Multi-Instance GPU manager, which only permits partitioning GPUs in bulk, with an open-source incremental-slicing controller without needing new APIs or changes to the device plugin. Come learn how to achieve incremental slicing in your GPU clusters.
Unleashing the power of DRA (Dynamic Resource Allocation) for just-in-time GPU slicing
AI/ML experts leveraging Kubernetes clusters to train, fine-tune, or serve large language models (LLMs) would like to dynamically allocate GPUs and GPU slices based on the demand of their workloads. The DRA (Dynamic Resource Allocation) approach currently developed by the community is promising but will require changes to Kubernetes scheduling mechanisms with the introduction of latency-inducing roundtrips between schedulers and DRA controllers. Moreover, GPU slices have to be requested by means of novel resource classes and claims, requiring users to adapt.
This talk demonstrates how we exploit DRA today to enable just-in-time GPU slicing on large production Kubernetes clusters running a mix of small fractional and large distributed workloads. InstaSlice acts on queued AI workloads to slice GPUs with the help of DRA. By augmenting DRA with InstaSlice, we make it simple for users to leverage DRA with zero changes to queued workloads and zero changes to Kubernetes schedulers.
Best practices for LLM serving with DRA
In the rapidly evolving landscape of Large Language Models (LLMs), overcoming low GPU cluster utilization (as low as 20-30% in traditional setups) is crucial for efficiently serving these models in Kubernetes. This talk will share insights from deploying and serving LLMs using MIG partitions and dynamic resource allocation (DRA). Our experiments discovered that the optimal GPU MIG partition size depends on the specific LLM model and its load, highlighting the necessity and feasibility of using Dynamic Resource Allocation (DRA) for dynamically scaling model-serving instances vertically.
We'll showcase deploying the open-source vLLM framework in Kubernetes, focusing on scaling vLLM instances for increased loads while maximizing GPU utilization. Attendees will gain practical knowledge on selecting effective MIG partitions for different models and using DRA to optimize their model-serving systems.
Training foundation model workloads on Kubernetes at scale with MCAD
Vela cloud-native AI supercomputer was built to train foundational models on Kubernetes. Different research teams inside IBM Research needed flexibility to use the framework of their choice for instance Pytorch, Ray, or Spark to train foundational models. There was a need to help users queue custom resources of their choice to support experimentation with high-level fault tolerance for training that spans across hundreds of GPUs and runs for weeks or months. In this talk, we describe the role Multi-Cluster App Dispatcher (MCAD) plays in queuing different custom resources required for large-scale AI training and its interplay with the underlying scheduler installed on the target Kubernetes cluster with the enablement of gang priority, gang preemption and fault tolerance in mind.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top