Session

Friend or Foe? Taming Node Taints and Labels for GPU Clusters on Managed Kubernetes

Node taints, labels, and selectors are powerful Kubernetes mechanisms for managing resource allocation and pod scheduling. When used effectively, they enhance cluster manageability, efficiency, and reliability. But for AI workloads running on managed Kubernetes platforms, such as GKE, EKS, AKS, and OKE, with specialized hardware like GPUs, these features can become sources of confusion, misconfiguration, and fragile deployments.

In this lightning talk, we’ll share lessons learned from building GPU-accelerated Kubernetes clusters on public clouds as part of NVIDIA DGX Cloud. We’ll cover practical strategies for designing and managing taints, labels, selectors, and tolerations, while addressing cloud-specific quirks that impact scheduling, including NVIDIA GPU management and the latest GB200 systems with ARM architecture in the cloud. The talk will help you avoid common pitfalls and run AI workloads seamlessly on Kubernetes.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML Infrastructure, Resource Management

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top