Kevin Klues

Distinguished Engineer at NVIDIA

Berlin, Germany

Actions

Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. He is a (shadow) member of the Technical Oversight Committee of the CNCF as well as a sig-node maintainer, having been involved in the design and implementation of a number of Kubernetes technologies, including the CPU Manager, the Device Manager, the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA). Kevin holds a bachelors degree from Rose-Hulman Institute of Technology, a masters degree from Washington University in St. Louis, and a Ph.D. from UC Berkeley. When not working, you can usually find Kevin hanging out with his kids or enjoying a beer in the sun somewhere.

Badges

NVIDIA Expert Session: Optimized Deployment of Large Language Models

Visit us at booth A-16 to get answers to your most pressing questions from NVIDIA technology experts, and claim a free self-paced DLI course on July 10-11.

A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes

Kubernetes traditionally does not have a mechanism for allocating non-node-local resources. The one exception being persistent volumes, which allow a user to attach the same volume to multiple pods running on different nodes. With the introduction of Dynamic Resource Allocation (DRA) we now have a way to allocate any type of resource with similar semantics.

In this talk, we discuss how DRA’s ability to allocate non-node-local resources has unlocked the potential to read / write remote GPU memory over high-bandwidth, multi-node NVLinks. We begin with an introduction on how DRA models non-node-local resources in general, followed by the specifics of how we have leveraged this capability to enable lightning fast multi-node training and inference on the NVIDIA GB200 NVL72 supercomputer. As part of this, we discuss how this support has been pushed to all major cloud providers and integrated with their managed Kubernetes offerings. We conclude with a demo.

Don't Let Your GPUs Get Lonely: Challenges with Distributed Multi-Node Training at Scale

As the prevalence of AL/ML workloads and larger models increases, proper management of multi-node training jobs becomes more important. It is becoming essential to squeeze the most out of GPUs and the networks that underlie them to maximize performance.

In this unique panel combining experts from cloud providers, hardware vendors, and OSS k8s maintainers, we will start from the basics of Distributed Deep Learning and then dive into how these jobs can be orchestrated in Kubernetes using the NCCL GPU framework. NCCL (NVIDIA Collective Communication Library) is a core technology for multi-node and multi-GPU training for distributed training and network acceleration of large AI models.

Starting from use cases, we will take a look under the hood at why distributed training is becoming a common pattern; how application frameworks like NCCL are interjected into ML frameworks; and how kubernetes can help to orchestrate multi-node training systems and the future in this area.

NIM+DRA: Running optimized GenAI models on Kubernetes at scale

NVIDIA NIMs are a set of microservices that accelerate the deployment of GPU-optimized generative AI models. They deliver unmatched NLP capabilities with GPU acceleration via CUDA, TensorRT, and Triton Inference Server.

Each NIM is packaged into a container along with a set of profiles that define which optimized model should be used for a set of available GPUs. For example, 2 optimized models exist for the Llama3 NIM – one for running on a single H100 and one for running across two A100s. Depending on what hardware is available, one model should be used over the other.

We have built an Operator to cache and deploy NVIDIA NIMs at scale, seamlessly integrating with DRA to manage and optimize GPU resources. Combining the NIM Operator and DRA significantly improves GPU scheduling and utilization, enhancing performance, reducing costs, and increasing flexibility. In this talk, we demonstrate the NIM Operator with DRA to showcase substantial improvements in model serving at scale.

Which GPU sharing strategy is right for you? A Comprehensive Benchmark Study using DRA

Dynamic Resource Allocation (DRA) is one of the most anticipated features to ever make its way into Kubernetes. It promises to revolutionize the way hardware devices are consumed and shared between workloads. In particular, DRA unlocks the ability to manage heterogeneous GPUs in a unified and configurable manner without the need for awkward solutions shoehorned on top of the existing device plugin API.

In this talk, we use DRA to benchmark various GPU sharing strategies including Multi-Instance GPUs, Multi-Process Service (MPS), and CUDA Time-Slicing. As part of this, we provide guidance on the class of applications that can benefit from each strategy as well as how to combine different strategies in order to achieve optimal performance. The talk concludes with a discussion of potential challenges, future enhancements, and a live demo showcasing the use of each GPU sharing strategy with real-world applications.

Kubernetes WG Device Management - Advancing K8s Support for GPUs

The goal of the recently formed WG Device Management is to enable simple and efficient configuration, sharing, and allocation of accelerators (such as GPUs and TPUs) and other specialized devices. This working group focuses on the APIs, abstractions, and feature designs needed to configure, target, and share the necessary hardware for both batch and serving (inference) workloads.

The current focus of the working group is the Dynamic Resource Allocation (DRA) feature. Come to this talk to learn what we have delivered in Kubernetes 1.31, what is coming in 1.32 and beyond, and how you can influence the roadmap for Kubernetes support of accelerated workloads.

From Vectors to Pods: Integrating AI with Cloud Native

The rise of AI is challenging long-standing assumptions about running cloud native workloads. AI demands hardware accelerators, vast data, efficient scheduling and exceptional scalability. Although Kubernetes remains the de facto choice, feedback from end users and collaboration with researchers and academia are essential to drive innovation, address gaps and integrate AI in cloud native.

This panel features end users, AI infra researchers and leads of the CNCF AI and Kubernetes device management working groups focussed on:

- Expanding beyond LLMs to explore AI for cloud native workload management, memory usage and debugging
- Challenges with scheduling and scaling of AI workloads from the end user perspective
- OSS Projects and innovation in AI and cloud native in the CNCF landscape
- Improving resource utilisation and performance of AI workloads

The next decade of Kubernetes will be shaped by AI. We don’t yet know what this will look like, come join us to discover it together.

From foundation model to hosted AI solution in minutes

AI-driven applications, co-hosted by IONOS and NVIDIA. Discover how IONOS leverages NVIDIA’s cutting-edge hardware to offer robust foundation models, propelling AI capabilities to new heights. Learn about IONOS's Kubernetes as a Service, designed to seamlessly integrate with powerful GPU infrastructure, ensuring optimal performance and scalability for your AI projects.

We will demonstrate the dynamic interaction between these solutions, showcasing real-world examples of how they work together to enhance AI-driven applications. This session will not only delve into current implementations but also explore future directions, providing insights into the potential advancements in AI applications facilitated by GPU integration within Kubernetes environments.

Unlocking the Full Potential of GPUs for AI Workloads on Kubernetes

Dynamic Resource Allocation (DRA) is new Kubernetes feature that puts resource scheduling in the hands of 3rd-party developers. It moves away from the limited "countable" interface for requesting access to resources (e.g. "nvidia.com/gpu: 2"), providing an API more akin to that of persistent volumes.

In the context of GPUs, this unlocks a host of new features without the need for awkward solutions shoehorned on top of the existing device plugin API.

These features include:
* Controlled GPU Sharing (both within a pod and across pods)
* Multiple GPU models per node (e.g. T4 and A100)
* Specifying arbitrary constraints for a GPU (min/max memory, device model, etc.)
* Dynamic allocation of Multi-Instance GPUs (MIG)
* … the list goes on ...

In this talk, you will learn about the DRA resource driver we have built for GPUs. We walk through each of the features it provides, including its integration with the NVIDIA GPU Operator. We conclude with a demo of how you can get started today.

DRAcon: demystifying Dynamic Resource Allocation - from myths to facts

At KubeCon NA 2023, dynamic resource allocation (DRA) made headlines because it was mentioned in the keynote. This generated so much buzz that Tim Hockin quipped on social media that it felt like he attended DRAcon instead of KubeCon. At KubeCon EU we’ll demystify this new technology!

DRA is a new approach for describing resource requirements in a Kubernetes cluster. It was first introduced in Kubernetes 1.26 and continues to remain in an alpha state in 1.29.

It offers several advantages compared to existing approaches:
Support for custom hardware can be added by developing and deploying DRA drivers, without having to modify Kubernetes.
Resource parameters are defined by vendors.
Sharing of a resource instance between containers and pods.

In order to move forward to beta and beyond, we need feedback from the community to understand whether it’s ready in its current form, who wants to use it for what, and how we can solve some of the open challenges, like cluster autoscaler support.

Mastering GPU Management in Kubernetes Using the Operator Pattern

Kubernetes is no longer just a tool for running workloads like web applications and microservices, it is the ideal platform for supporting the end-to-end lifecycle of large artificial intelligence (AI) and machine learning (ML) workloads, such as LLMs.

GPUs have become the foundation of this workload shift. However, managing GPUs in a Kubernetes cluster requires full-stack knowledge from the installation of kernel drivers to the setup of container runtimes, device plugins, and a monitoring stack. These activities can be broken down into 4 phases.

Installation of the GPU software stack on a small cluster
Infrastructure build-out by adding more nodes
Lifecycle management, Software Updates
Monitoring and Error recovery

In this talk, we discuss leveraging the operator pattern for the lifecycle management of GPU software in K8s. We demo the NVIDIA GPU Operator to show how the operator pattern can benefit K8s admin from basic driver installation to managing advanced AI/ML use cases.

Running AI Workloads in Containers and Kubernetes

Containers are the best way to run machine learning and AI workloads in the cloud. However, running these workloads efficiently poses unique challenges, from resource management to performance optimization.

In this talk, we dive into the details of how GPUs are made available to such workloads when running with both standalone containers as well as with Kubernetes. As part of this, we discuss various options for sharing GPUs between them. These techniques include simple time-slicing, MPS, and MIG.

By the end of this session, attendees will have a comprehensive understanding of how GPU support in containers and Kubernetes works under the hood, as well as the knowledge required to make the most efficient use of GPUs in their own applications.

ContainerDays Conference 2024 Sessionize Event

September 2024 Hamburg, Germany

WeAreDevelopers World Congress 2024 Sessionize Event

July 2024 Berlin, Germany

Maintainer Track + ContribFest: KubeCon + CloudNativeCon Europe 2024 Sessionize Event

March 2024 Paris, France

KubeCon + CloudNativeCon Europe 2024 Sessionize Event

March 2024 Paris, France

KubeCon + CloudNativeCon North America 2023 Sessionize Event

November 2023 Chicago, Illinois, United States

Kevin Klues

Distinguished Engineer at NVIDIA

Berlin, Germany

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Most Active Speaker

Kevin Klues

Actions

Links

Badges

Sessions

NVIDIA Expert Session: Optimized Deployment of Large Language Models

A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes

Don't Let Your GPUs Get Lonely: Challenges with Distributed Multi-Node Training at Scale

NIM+DRA: Running optimized GenAI models on Kubernetes at scale

Which GPU sharing strategy is right for you? A Comprehensive Benchmark Study using DRA

Kubernetes WG Device Management - Advancing K8s Support for GPUs

From Vectors to Pods: Integrating AI with Cloud Native

From foundation model to hosted AI solution in minutes

Unlocking the Full Potential of GPUs for AI Workloads on Kubernetes

DRAcon: demystifying Dynamic Resource Allocation - from myths to facts

Mastering GPU Management in Kubernetes Using the Operator Pattern

Running AI Workloads in Containers and Kubernetes

Events

ContainerDays Conference 2024 Sessionize Event

WeAreDevelopers World Congress 2024 Sessionize Event

Maintainer Track + ContribFest: KubeCon + CloudNativeCon Europe 2024 Sessionize Event

KubeCon + CloudNativeCon Europe 2024 Sessionize Event

KubeCon + CloudNativeCon North America 2023 Sessionize Event

Kevin Klues

Links

Actions