Speaker

Zhixin Huo

Zhixin Huo

Alibaba Cloud Intelligence, Senior Software Engineer

Beijing, China

Actions

Cloud computing engineer from Alibaba Cloud Elastic Computing, engaged in containerization, observability, and application optimization related to GPU and RDMA workloads on Kubernetes, with experience in building and maintaining large-scale intelligent computing clusters.

Area of Expertise

  • Information & Communications Technology
  • Real Estate & Architecture
  • Travel & Tourism

Topics

  • Kubernetes
  • Container Technology
  • Cloud Native & Kubernetes
  • Alibaba Cloud
  • AI Container

AI Profiling: A new online fine-grained observability solution for AI workloads on Kubernetes

LLM training and inference are expanding resource demand and footprint in AI Kubernetes clusters, while task failures and performance regressions surge. Beyond fleet-level monitoring, fine-grained, workload-centric observability and operator-level tuning (e.g., CUDA/Torch ops) are required. A container-native AI profiling capability for Kubernetes, built on dynamic injection and eBPF, provides zero-instrumentation, zero-disruption, online, dynamically switchable profiling with ultra-low overhead, capturing end-to-end call chains and communication paths across the stack. Multi-dimensional telemetry—CPU, CUDA kernels, Torch Profiler, system calls, CPython, RDMA networking—is correlated to surface bottlenecks and interference. The approach enables targeted diagnosis and remediation of production issues, illustrated by an LLM inference case. Deployment variants support runc and Kata Containers for multi-tenant, security-hardened, performance-critical clusters.

Strengthening Resilience and Cost-Effectiveness in LLM Training Through Tackling Disruptions

In large language model (LLM) training, time and computational costs are high, making resilience crucial. Fault recovery relies on frequent checkpoints, but traditional methods face a conflict between high time and expense costs and the risk of losing results with reduced frequency. Preemptible resources offer cost advantages but risk reclamation and inefficient resource switching limits cost optimization.
To tackle these, this talk dives into to address both training interruptions and resource supply disruptions. We will explore the elastic fault tolerance and recovery mechanisms in LLM training and how to enhance the flexibility of resource switching. Key points include:
1.Efficient Fault Recovery:Ensures rapid training task recovery during fault occurs and resource interruptions.
2.Elastic Architecture: Reduces interruptions via dynamic resource adjustments and seamless transitions.
3.Cost Optimization: Flexibly replace cost-effective resources based on resource supply conditions.

Profiling with eBPF and Dynamic Process Injection: Optimize AI Workload Performance

As the scale of AI workloads rapidly increases, traditional GPU monitoring solutions can only provide limited observability, and effective full-stack analysis and optimization of AI workload performance is a significant challenge.

In this presentation, we will introduce how to leverage eBPF technology and dynamic process injection techniques to implement a non-intrusive, low-overhead online profiling solution for AI workloads. We will demonstrate how this profiling mechanism captures data from AI frameworks, GPU kernel functions, GPU library calls, system calls, and CPU contexts, presenting it as a visual timeline representation. Based on these visual results, we can conduct fine-grained analyses of bottlenecks and potential issues, such as slow training iterations, GPU data transfer delays, NCCL hangs, and CPU overloads. Additionally, we will share the practical effects of implementing this profiling solution for AI workloads, aiming to enhance training and inference performance.

Maximizing ML Efficiency: Advanced Scheduling Strategies and Elastic Training

Nowadays, AI resource costs are rising, making it challenging to reduce overall expenses and improve resource utilization in AI workload clusters. Kubernetes and Job-Supervisor offer advanced scheduling strategies that can help address this issue. In clusters with diverse resource types, a ResourcePolicy can prioritize resources for AI workloads, enhancing control over scheduling. For stateful tasks, we provide robustness against disruptions like task preemption or GPU failures by notifying AI workloads in advance, allowing them to save checkpoints and prevent data loss. We also offer ElasticQuota capabilities for tenants to manage resource usage and preemption more finely. For greater flexibility and robustness, combining these strategies with elastic training capabilities minimizes application framework intrusion, enabling seamless switching of resource usage and achieving higher resource utilization. We will present a best practice aimed at enhancing cluster resource efficiency.

Efficient online profiling of AI workloads on Kubernetes based on eBPF and dynamic process injection

Nowadays, most AI workloads are running in Kubernetes, and effectively analyzing and optimizing the performance of large-scale AI workloads remains a significant challenge. We utilize eBPF technology and process dynamic injection techniques to monitor the Pods running the AI workloads. This approach enables online AI profiling that is transparent, non-intrusive, on-demand, and has low overhead for business Pods. We capture data from various aspects such as AI frameworks, GPU kernel functions, GPU library calls, system calls, and CPU utilization, ultimately summarizing and presenting it as a timeline rendering graph. This allows for fine-grained analysis of bottlenecks and potential issues, such as slow training, GPU data transfer delays, or NCCL hang-ups. Additionally, we support profiling for various types of GPU resources to ensure versatility. The online AI profiling solution ultimately determines the optimization direction for GPUs and AI tasks within the Kubernetes cluster.

Online Profiling and Analysing GPU and AI jobs with eBPF and Kubernetes

Nowadays, most of the AI workloads run in Kubernetes. While there are mature GPU scheduling and monitoring solutions, it's still challenge to analysis and optimize the performance of AI workloads efficiently. In Alibaba Cloud, we use eBPF technology to trace the running AI pods, construct a unified timeline with multiple metrics dimensions, including CPU calls, OS system calls, GPU driver library calls, CUDA calls, Pytorch and Python methods invocation. We achieve comprehensive observation of AI tasks in realtime. We utilize CUDA interception technique to integrate GPU kernel execution details into the overall timeline. A profiling pod can be attached to the AI job without intrusion on demand. As a result, it can perform fine-grained analyses of bottleneck and potential issues like slow training iteration, GPU data transfer delay, NCCL hangs, or CPU overloaded. The online profiling solution ultimately identifies the optimization directions for GPU and AI tasks in Kubernetes cluster.

KCD Hangzhou + OpenInfra Days China 2025 Sessionize Event

November 2025 Hangzhou, China

Zhixin Huo

Alibaba Cloud Intelligence, Senior Software Engineer

Beijing, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top