Training foundation model workloads on Kubernetes at scale with MCAD

Vela cloud-native AI supercomputer was built to train foundational models on Kubernetes. Different research teams inside IBM Research needed flexibility to use the framework of their choice for instance Pytorch, Ray, or Spark to train foundational models. There was a need to help users queue custom resources of their choice to support experimentation with high-level fault tolerance for training that spans across hundreds of GPUs and runs for weeks or months. In this talk, we describe the role Multi-Cluster App Dispatcher (MCAD) plays in queuing different custom resources required for large-scale AI training and its interplay with the underlying scheduler installed on the target Kubernetes cluster with the enablement of gang priority, gang preemption and fault tolerance in mind.

Abhishek Malvankar

Senior Software Engineer, Master Inventor at IBM Research

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Training foundation model workloads on Kubernetes at scale with MCAD

Abhishek Malvankar

Links

Actions