Speaker

Eduardo Arango Gutierrez

Eduardo Arango Gutierrez

Senior Systems Software Engineer @NVIDIA

Landsberg am Lech, Germany

Actions

Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.

Area of Expertise

  • Environment & Cleantech
  • Information & Communications Technology

Best of both worlds: integrating Slurm with Kubernetes in a Kubernetes native way

It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads.

In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API.

This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.

Advancements in AI/ML Inference Workloads on Kubernetes from WG Serving and Ecosystem Projects

The emergence of Generative AI (GenAI) has introduced new challenges and demands in AI/ML inference, necessitating advanced solutions for efficient serving infrastructures. The Kubernetes Working Group Serving (WG Serving) is dedicated to enhancing serving workload on K8s, especially for hardware-accelerated AI/ML inference. This group prioritizes compute-intensive inference scenarios using specialized accelerators, benefiting various serving workloads such as web services and stateful databases.

This session will dive into recent progress and updates on WG Serving's initiatives and workstreams. We will spotlight discussions and advancements in each workstream. We are also actively looking for feedback and partnership with model server authors and other practitioners who want to utilize powers of K8s for their serving workloads. Join us to gain insight into our work and learn how to contribute to advancing AI/ML inference on K8s.

Get the most out of your GPUs on Kubernetes with the GPU Operator

NVIDIA’s GPU operator has become the de-facto standard for managing GPUs in Kubernetes at scale. This tutorial provides in-depth, hands-on training on the various GPU sharing techniques that are possible with the GPU operator. Participants will learn to deploy jobs utilizing these sharing techniques, as well as get hands-on experience on the installation and configuration of the NVIDIA GPU Operator itself. This includes an in-depth exploration of its two primary CRDs: ClusterPolicy and NVIDIADriver. These CRDs are essential for configuring GPU-accelerated nodes, enabling GPU sharing mechanisms, and performing GPU driver upgrades. The session will culminate with practical use cases, such as training an AI/ML model and giving participants firsthand experience in managing a GPU-accelerated Kubernetes cluster.

Cloud-Native Supercomputing: Exploring New Technologies for High-Performance Computing

As cloud-native technologies continue to advance, researchers and engineers in high-performance computing (HPC) are beginning to explore how these technologies can be used to build more scalable, reliable, and efficient supercomputers. In this talk, we will explore some of the latest technologies being used in cloud-native supercomputing, and discuss how they are being integrated into Kubernetes, the de facto standard for container orchestration.

Join the Journey: Contribute to Node Feature Discovery and Shape AI-Ready Kubernetes Infrastructure

Do you have experience in Kubernetes node management or infrastructure enablement and aspire to grow into a maintainer or approver role? Join us to explore the current roadmap of the Kubernetes-sigs Node Feature Discovery project, identify good entry points for first contributions, and discuss how this project supports AI/ML workloads on Kubernetes.

Introducing ClusterInventory and ClusterFeature API

Discover how ClusterInventory and ClusterFeature APIs fortify the Kubernetes/CNCF MultiCluster ecosystem. ClusterInventory streamlines cluster management, integrating diverse tools, while ClusterFeature reveals unique cluster attributes, adding depth. This synergy boosts efficiency and flexibility in multi-cluster environments.

Explore how these APIs reshape Kubernetes multi-cluster operations, gaining insights for tool compatibility across clusters and smooth transitions between different managers. Join us for a deep dive into the transformative potential, charting the future of Kubernetes multi-cluster application management. During the talk, we will demo a custom controller that makes use of both APIs to easily manage a multi-cluster Environment.

Attendees will gain knowledge on how to utilize these two new APIs to simplify multi-cluster management, as well as how to create custom controllers to build upon them.

K-foundry: Cloud Native Slurm (A project update)

At KubeCon North America 2024, we introduced K-foundry, a KCP-based controller designed to provide the HPC scheduler Slurm with a cloud-native interface, leveraging a Kubernetes control plane.Since our initial presentation, we have seen an enthusiastic response from the community, with many individuals and organizations coming together to actively participate in and contribute to the development of K-foundry. This collaborative effort has enriched the project and highlighted the power of open source in driving technological advancement.

In this talk, we will delve into the various challenges we encountered during the development process and discuss the innovative solutions we implemented through K-foundry. Additionally, we will share our vision for the future, outlining the roadmap for V1 and what users can expect as the project evolves. Join us as we explore the journey of K-foundry and its potential impact for AI/ML workloads in a Cloud Native ecosystem.

NFD: Simplifying Cluster Administration through Automated Node Labels, Taints, and Annotations

Join lead contributors for an in-depth update on the Kubernetes SIGs project Node Feature Discovery (NFD) – an indispensable add-on for managing labels and other node properties. Gain valuable insights into the project and its impact on Kubernetes cluster management, exploring practical applications, hidden capabilities, and future prospects, enriched by our experiences on heterogeneous clusters.

Delve into the built-in feature detection functions of NFD and explore its diverse customization capabilities, designed to meet a broad spectrum of specific node management needs. Our presentation will showcase the latest developments in NFD, offering a comprehensive view of practical usage scenarios. Witness how NFD enables GPUs and other specialized hardware in Kubernetes and empowers users to integrate confidential computing technologies into their clusters.

Ultimately, NFD delivers cluster admins an automated, secure, and reliable solution for both node feature discovery and labeling.

WG Serving: Accelerating AI/ML Inference Workloads on Kubernetes

The emergence of Generative AI (GenAI) has introduced new challenges and demands in AI/ML inference, necessitating advanced solutions for efficient serving infrastructures. The recently created Kubernetes Working Group Serving (WG Serving) is dedicated to enhancing serving workload on K8s, especially for hardware-accelerated AI/ML inference. This group prioritizes compute-intensive inference scenarios using specialized accelerators, benefiting various serving workloads such as web services and stateful databases.

This session will dive into WG Serving's initiatives and workstreams. We will spotlight discussions and advancements in each workstream. We are also actively looking for feedback and partnership with model server authors and other practitioners who want to utilize powers of K8s for their serving workloads. Join us to gain insight into our work and learn how to contribute to advancing AI/ML inference on K8s.

Eduardo Arango Gutierrez

Senior Systems Software Engineer @NVIDIA

Landsberg am Lech, Germany

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top