Speaker

Peter Pan

Cloud-Native Developer , Open Source Enthusiast

DaoCloud, 研发工程负责人

Shanghai, China

Actions

- CNCF AI Working Group Member

- DaoCloud Software Engineering VP

- Regular KubeCon "Program Committee" : 2023 EU, 2024 HK, 2024 India, 2025 EU
- Regular KubeCon Speaker: 2023 SH, 2024 EU, 2024 HK
- CNCF wg-AI (AI Working-Group) member + CNAI white-paper co-author
- Maintainer of below CNCF projects : cloudtty, kubean, hwameistor
- GithubID: panpan0000

云原生信仰者
LLM爱好者
KubeCon多任 Program Committee
KubeCon多次演讲者
CNCF项目维护者
CNCF AI工作组成员

Area of Expertise

Information & Communications Technology

Topics

Kubernetes

To Conquer 10x workloads on AI infrastructure

With the situation of limited GPU, significant challenges arise to fulfill growing user demands, and diverse business scenarios upon the existing infra.

The traditional AI infra resource management ways struggle to address all the practical concerns: 10x tenants, 10x workloads scenarios, and 10x models serving simultaneous.

To optimize GPU utilization, build a solid foundation for AI innovation, this talk shares technical insights below:

1) How to implement isolation and sharing for 10x resource over-commission for many tenants/users, with `Kueue` / `scheduler-plugin`

2) Prioritize 10x co-located workloads who are competing for resources, and fill their resource compactness , with `HAMi` & `DRA`

3) Faced 10x amounts of models to be served, how to strike a balance between on-demand loading & latency SLO using `Lingo`, and how `ollama-operator` streamlines the serving of multiple models

Please refers to `Additional resources` for further tech-details

Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC

Large AI models are driving significant investment in GPU clusters. Yet, managing these clusters is hard: Slurm-based HPC setups lack of management granularity and stability, while Kubernetes poses usability challenges for AI users.

This talk introduces TACC, an AI infra management solution that bridges the advantages of both K8S and Slurm setups. This is a joint-work from computer system researchers at HKUST and leading CNCF contributors at DaoCloud.

TACC manages a large-scale cluster at HKUST that supports over 500 active researchers since 2020. In this talk, we share our five-year journey with TACC, covering:

* [User Experience] A seamless UI for job submissions and management, supporting both container and Slurm format, all on the same backbone
* [Resource Management] Multi-tenant allocation with configurable strategies, using CNCF HAMi and Kueue
* [Performance and Scalability] A robust distributed infrastructure with networked storage and RDMA, via CNCF SpiderPool,Fluid...

Locking the Monster: Strategies to Isolate Resource Big Eaters

For Kubernetes containers on the same node, they may compete for crucial resources such as CPU, memory, network, disk, kernel parameters, GPU, and others.

Although we are not defenseless: Kubernetes QoS , Quota and GC mechanism can restrict most potential problems.
But for some other cases, pods may be able to break through container isolation walls (consciously or unconsciously), becoming disruptive neighbors, causing performance degradation, even node failures: examples: Pods eat up shared kernel resources( pid, fs.inotify), network resources(tcp_max_tw_buckets), overconsumption ..etc

when goes to AI/LLM workloads, GPU contention is another main issue, as well as pod heavy stress on IO(gradient aggregation, checkpoint saving, dataset loading)

This talk shares cases of resource-intensive pods and resource contention, then seek mitigation solutions, to minimize the impact of disruptive neighbors, enhance resource utilization, and prevent node failures.

Sailing Kubernetes operation with AI power

Introducing two solutions to leverage AI power to increase Kubernetes SRE management efficiency:

- 1) Introduce and demo of open source project `k8sgpt`: Run health check for workloads with `k8sgpt`, and get remediation suggestion from LLM like OpenAI or Local-AI.

- (a) The theory of codified knowledge
- (b) Deep dive about the LLM explanation and sensitive data protection.

- 2) Translate the k8s operations from human language interaction, to lower the SRE barrier. let chatGPT be the "translator": A way of new user interface.

- (a) With simply prompt to turn operation into kubectl bash commands like one of the example project `kubectl-ai`
- (b) Solve or diagnostic complex problems with `autoGPT`-like concepts with multiple iterations of k8s operations.

Peter Pan

Cloud-Native Developer , Open Source Enthusiast

Shanghai, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Peter Pan

Actions

Links

Area of Expertise

Topics

Sessions

To Conquer 10x workloads on AI infrastructure

Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC

Locking the Monster: Strategies to Isolate Resource Big Eaters

Sailing Kubernetes operation with AI power

Peter Pan

Links

Actions