Speaker

Claudia Misale

Claudia Misale

Staff Research Scientist, IBM Research

Actions

Claudia Misale is a Staff Research Scientist in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes and targets monitoring, observability and scheduling for HPC and AI training workloads. She is mainly interested in cloud computing and container technologies, and her background is on high-level parallel programming models and patterns, and big data analytics on HPC platforms.

Area of Expertise

  • Information & Communications Technology

Cluster Management for Large Scale AI and GPUs: Challenges and Opportunities

There are new challenges in managing large GPU clusters dedicated to cloud native AI workloads. The workload mix is diverse, and GPUs must be effectively utilized and dynamically shared across multiple teams. Furthermore, GPUs are subject to a variety of performance degradations and faults that can severely impact multi-GPU jobs, thus requiring continuous monitoring and enhanced diagnostics. Cloud native tools such Kubeflow, Kueue and others, are the building blocks for large scale GPU clusters used by teams across IBM Research for training, tuning, and inference jobs. In this talk, IBM Research will share and demonstrate lessons learnt on how they configure large scale GPU clusters and the development of Kubernetes native automation to run health checks on GPUs and report health. Finally, will show the use of diagnostics to enable both the dynamic adjustment of quotas to account for faulty GPUs, and the automatic steering of new and existing workloads away from nodes with faulty GPUs.

Build, Operate, and Use a Multi-Tenant AI Cluster Based Entirely on Open Source

With GPUs being scarce and costly, multi-tenant Kubernetes clusters that can queue and prioritize complex, heterogeneous AI/ML workloads while achieving both high utilization and fair sharing, are a necessity for many organizations. This tutorial will teach the audience how to build, operate, and use an AI cluster. Starting from either a managed or on-premise Kubernetes cluster, we will demonstrate how to install and configure a number of open source projects (and only open source projects) such as Kueue, Kubeflow, PyTorch, Ray, vLLM, and Autopilot to support the full AI model lifecycle (from data preprocessing to LLM training and inference), configure teams and quotas, monitor GPUs, and to a large degree automate fault detection and recovery. By the end of the tutorial the participants will have a thorough understanding of the AI software stack refined by IBM Research over several years to effectively manage and utilize thousands of GPUs. Come to learn the recipe and try it at home!

Claudia Misale

Staff Research Scientist, IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top