Mastering GPU Management in Kubernetes Using the Operator Pattern

Kubernetes is no longer just a tool for running workloads like web applications and microservices, it is the ideal platform for supporting the end-to-end lifecycle of large artificial intelligence (AI) and machine learning (ML) workloads, such as LLMs.

GPUs have become the foundation of this workload shift. However, managing GPUs in a Kubernetes cluster requires full-stack knowledge from the installation of kernel drivers to the setup of container runtimes, device plugins, and a monitoring stack. These activities can be broken down into 4 phases.

Installation of the GPU software stack on a small cluster
Infrastructure build-out by adding more nodes
Lifecycle management, Software Updates
Monitoring and Error recovery

In this talk, we discuss leveraging the operator pattern for the lifecycle management of GPU software in K8s. We demo the NVIDIA GPU Operator to show how the operator pattern can benefit K8s admin from basic driver installation to managing advanced AI/ML use cases.

Kevin Klues

Distinguished Engineer at NVIDIA

Berlin, Germany

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Mastering GPU Management in Kubernetes Using the Operator Pattern

Kevin Klues

Links

Actions