Yuan Chen

Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML Infrastructure, Resource Management

San Jose, California, United States

Actions

Yuan Chen is a Principal Software Engineer at NVIDIA, working on cloud-native Kubernetes infrastructure for AI in DGX Cloud. He previously built hyper-scale Kubernetes infrastructure at Apple. He has contributed to open source Kubernetes and the CNCF community, delivering 14 KubeCon talks. Yuan was a Principal Architect at JD.com and a Principal Research Scientist at Hewlett Packard Labs. He holds a Ph.D. in Computer Science from Georgia Tech.

Area of Expertise

Information & Communications Technology

Topics

cloud compare
Cloud Native
Kubernetes

A Quick Guide to Setting Up DRA and Managing GPU Resources for AI/ML Workloads in Kubernetes

Dynamic Resource Allocation (DRA) introduces a new paradigm for requesting, configuring, and sharing GPU resources in Kubernetes. It enables fine-grained, flexible resource management for AI/ML workloads in a unified and customizable manner.

This session provides a concise guide and live demo on installing, configuring and using DRA. Attendees will learn how to use DRA to effectively manage GPU resources in a kind cluster on a local Linux machine, covering the following use cases:

- A single pod/container using a dedicated GPU
- Multiple containers within a single pod sharing a dedicated GPU
- Multiple pods/containers sharing a dedicated GPU using different strategies: Time-Slicing and Multiple Process Service (MPS)
- Multiple pods/containers sharing multiple GPUs with Multiple Instance GPU (MIG)
- Layered GPU sharing across multiple pods/containers, such as Time-Slicing on MIG and MPS on MIG

A Practical Guide to Benchmarking AI and GPU Workloads in Kubernetes

Effective benchmarking is required to optimize GPU resource efficiency and enhance performance for AI workloads. This talk provides a practical guide on setting up, configuring, and running various GPU and AI workload benchmarks in Kubernetes.

The talk covers benchmarks for a range of use cases, including model serving, model training and GPU stress testing, using tools like NVIDIA Triton Inference Server, fmperf: an open-source tool for benchmarking LLM serving performance, MLPerf: an open benchmark suite to compare the performance of machine learning systems, GPUStressTest, gpu-burn, and cuda benchmark. The talk will also introduce GPU monitoring and load generation tools.

Through step-by-step demonstrations, attendees will gain practical experience using benchmark tools. They will learn how to effectively run benchmarks on GPUs in Kubernetes and leverage existing tools to fine-tune and optimize GPU resource and workload management for improved performance and resource efficiency.

Making Kubernetes GPU- and AI-Ready on Cloud: The Missing Runtime Pieces

Kubernetes is becoming the go-to platform for AI workloads, with GPU Operator serving as a key enabler by simplifying GPU management. However, large-scale AI demands more: managing diverse high-performance networking fabrics, tuning configurations across different cloud and on-prem environments, and optimizing container environments for AI/ML workloads.

To address this, we propose an accelerator-optimized runtime stack to manage underlying operators and components such as GPU Operator, Network Operator, DRA driver, etc. It automates deployment, configuration, and lifecycle management of these components, delivering a production-ready accelerated container environment that “just works” for AI/ML workloads on Kubernetes.

In this talk, we present the design and implementation of this runtime stack for NVIDIA DGX Cloud's Kubernetes AI platform, sharing real-world lessons and operational experience to help you efficiently run and scale AI workloads on Kubernetes.

Friend or Foe? Taming Node Taints and Labels for GPU Clusters on Managed Kubernetes

Node taints, labels, and selectors are powerful Kubernetes mechanisms for managing resource allocation and pod scheduling. When used effectively, they enhance cluster manageability, efficiency, and reliability. But for AI workloads running on managed Kubernetes platforms, such as GKE, EKS, AKS, and OKE, with specialized hardware like GPUs, these features can become sources of confusion, misconfiguration, and fragile deployments.

In this lightning talk, we’ll share lessons learned from building GPU-accelerated Kubernetes clusters on public clouds as part of NVIDIA DGX Cloud. We’ll cover practical strategies for designing and managing taints, labels, selectors, and tolerations, while addressing cloud-specific quirks that impact scheduling, including NVIDIA GPU management and the latest GB200 systems with ARM architecture in the cloud. The talk will help you avoid common pitfalls and run AI workloads seamlessly on Kubernetes.

Which GPU sharing strategy is right for you? A Comprehensive Benchmark Study using DRA

Dynamic Resource Allocation (DRA) is one of the most anticipated features to ever make its way into Kubernetes. It promises to revolutionize the way hardware devices are consumed and shared between workloads. In particular, DRA unlocks the ability to manage heterogeneous GPUs in a unified and configurable manner without the need for awkward solutions shoehorned on top of the existing device plugin API.

In this talk, we use DRA to benchmark various GPU sharing strategies including Multi-Instance GPUs, Multi-Process Service (MPS), and CUDA Time-Slicing. As part of this, we provide guidance on the class of applications that can benefit from each strategy as well as how to combine different strategies in order to achieve optimal performance. The talk concludes with a discussion of potential challenges, future enhancements, and a live demo showcasing the use of each GPU sharing strategy with real-world applications.

Navigating AI/ML Workloads in Large-Scale Kubernetes Clusters

Managing AI/ML workloads with GPUs on Kubernetes presents formidable challenges due to complex job management and scheduling, along with the need for substantial specialized computing resources, such as GPUs, which are not readily available.

This talk introduces Knavigator, an open-source framework and toolkit designed to support developers of Kubernetes systems. Knavigator facilitates the development, testing, troubleshooting, benchmarking, chaos engineering, performance analysis, and optimization of AI/ML control planes with GPUs in Kubernetes.

Knavigator enables tests on Kubernetes clusters using both real and virtual GPU nodes, allowing for large-scale testing with limited resources, such as a laptop.

Through real examples and demos, this presentation will showcase Knavigator's capabilities in feature validation, performance, load testing, and reliability testing. It will also highlight how Knavigator enhances the fault tolerance of large model training jobs in Kubernetes.

Enhancing Reliability and Fault-Tolerance Testing in Kubernetes Using KWOK

Kubernetes has emerged as a popular platform for running AI workloads with GPUs. As a result, enhancing reliability has become increasingly important. This talk will demonstrate how the popular Kubernetes testing toolkit KWOK has been enhanced for reliability and fault-tolerance testing.

Shiming Zhang, the creator and maintainer of KWOK, and Yuan Chen from NVIDIA, will outline KWOK's capabilities to simulate and manage a large number of virtual nodes and pods on a laptop, and discuss practical use cases at DaoCloud and NVIDIA.

The session will provide examples and demos, offering a deep dive into KWOK’s latest chaos engineering features, including its ability to simulate failures by introducing targeted fault injections into GPU nodes and pods, thereby facilitating reliability testing, and evaluation of fault-tolerance mechanisms for improving the resilience of AI workloads in Kubernetes.

Attendees will gain practical experience and knowledge about KWOK and its advanced capabilities.

Decoding and Taming the Soaring Costs of Large Language Models

Running generative AI applications can be prohibitively expensive. This talk unravels the soaring costs of inference - providing real-time responses to user queries using large language models with billions or trillions of parameters.

We estimate the staggering costs to serve individual user queries by accessing massive models, and delve into the performance and cost challenges, from GPU hardware accelerators to latencies in running ChatGPT.

We explore the potential for improving resource efficiency and performance of running large-scale AI applications on Kubernetes and in cloud-native environments through GPU resource sharing, advanced scheduling, and dynamic batching.

We hope this talk will spur further discussion and collaboration within the CNCF community around taming the costs of deploying and scaling generative AI using cloud native technologies and best practices.

Accelerating AI Workloads with GPUs in Kubernetes

As AI and machine learning become ubiquitous, GPU acceleration is essential for model training and inference at scale. However, effectively leveraging GPUs in Kubernetes brings challenges around efficiency, configuration, extensibility, and scalability.

This talk provides a comprehensive overview of capabilities in Kubernetes and GPUs to address these challenges, enabling seamless support for next-generation AI applications.

The session will cover:

- GPU resource sharing mechanisms such as MPS (Multiple-Process Service), Time-Slicing, MIG (Multi-Instance GPU), and vGPU (virtualization with vGPU) on Kubernetes.

- Flexible accelerator configuration via DevicePlugins and Dynamic Resource Allocation with ResourceClaims and ResourceObjects in Kubernetes.

- Advanced scheduling and resource management features including gang scheduling, topology-aware scheduling, quota management, and job queues.

- The open-source efforts in Volcano, Yunikorn and Slurm for supporting GPU and AI workloads in Kubernetes.

Securing Kubernetes: Migrating from Long-Lived to Time-Bound Tokens Without Disrupting Existing Apps

In earlier versions of Kubernetes, secrets containing long-lived tokens are automatically generated for service accounts, posing security risks as these tokens do not expire and could be shared among pods and users. Recent updates have introduced TokenRequestAPI to obtain time-bound tokens with bounded lifetimes, enhancing security practices and discouraging the use of long-lived tokens.

Yuan Chen and James Munnelly will delve into the details of these changes, shedding light on their impact and providing strategies for migrating existing long-lived tokens to time-bound tokens without disrupting current customer applications. Additionally, they will share best practices for tracking and monitoring different token uses within a Kubernetes cluster. This includes legacy long-lived tokens, time-bound tokens created via TokenRequestAPI, and manually managed long-lived tokens. They will also address effective management of time-bound token expiry in large-scale Kubernetes clusters.

From Novice to Contributor: Making Your Mark in Kubernetes and CNCF Open Source Projects

Yuan Chen, an active Kubernetes open source contributor from Apple, will guide open source novices on a journey to make their initial contributions in the world of Kubernetes and CNCF projects. Drawing from his personal experience, Yuan will provide a comprehensive roadmap, offering a step-by-step walkthrough on filing issues, submitting pull requests, engaging in fruitful discussions, and navigating the review process.

Yuan will address the potential challenges that may arise along the open source path and share effective strategies for conflict resolution. Additionally, he will provide invaluable insights into time management, empowering individuals to strike a harmonious balance between personal/work commitments and their open source endeavors.

This talk is designed to empower open source novices, equipping them with the knowledge and confidence to make their initial and impactful contributions that truly count within the CNCF community.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML Infrastructure, Resource Management

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Yuan Chen

Actions

Links

Area of Expertise

Topics

Sessions

A Quick Guide to Setting Up DRA and Managing GPU Resources for AI/ML Workloads in Kubernetes

A Practical Guide to Benchmarking AI and GPU Workloads in Kubernetes

Making Kubernetes GPU- and AI-Ready on Cloud: The Missing Runtime Pieces

Friend or Foe? Taming Node Taints and Labels for GPU Clusters on Managed Kubernetes

Which GPU sharing strategy is right for you? A Comprehensive Benchmark Study using DRA

Navigating AI/ML Workloads in Large-Scale Kubernetes Clusters

Enhancing Reliability and Fault-Tolerance Testing in Kubernetes Using KWOK

Decoding and Taming the Soaring Costs of Large Language Models

Accelerating AI Workloads with GPUs in Kubernetes

Securing Kubernetes: Migrating from Long-Lived to Time-Bound Tokens Without Disrupting Existing Apps

From Novice to Contributor: Making Your Mark in Kubernetes and CNCF Open Source Projects

Yuan Chen

Links

Actions