Advanced GPU-Orchestrated Workflows and HPC Integrations on K8s for Distributed AI/ML at Scale

As AI/ML workloads continue to scale in complexity, developers and platform engineers are pushing Kubernetes beyond typical MLOps boundaries.

This talk dives into strategies for orchestrating GPU-accelerated training and inference across large-scale clusters -integrating HPC principles, operator-based scheduling, and novel debugging workflows.

Attendees will learn how to implement fine-grained GPU partitioning, harness ephemeral containers to probe and adjust multi-node training in real time, and adopt eBPF-driven instrumentation for low-overhead kernel-level performance insights. We’ll explore cutting-edge scheduling optimizations—like reinforcement-learning approaches and HPC-inspired batch-queuing orchestration on Kubernetes that dynamically respond to heterogeneous job demands.

Real-world case studies will highlight HPC integration scenarios (RDMA, GPU Direct) for data-parallel workloads and complex training frameworks such as Horovod, Ray, and Spark on Kubernetes.

Brandon Kang

Kubernetes, Cloud Native, Open source, Principal Solutions Architect, Akamai Technologies

Seoul, South Korea

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Advanced GPU-Orchestrated Workflows and HPC Integrations on K8s for Distributed AI/ML at Scale

Brandon Kang

Links

Actions