Session

Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC

Large AI models are driving significant investment in GPU clusters. Yet, managing these clusters is hard: Slurm-based HPC setups lack of management granularity and stability, while Kubernetes poses usability challenges for AI users.

This talk introduces TACC, an AI infra management solution that bridges the advantages of both K8S and Slurm setups. This is a joint-work from computer system researchers at HKUST and leading CNCF contributors at DaoCloud.

TACC manages a large-scale cluster at HKUST that supports over 500 active researchers since 2020. In this talk, we share our five-year journey with TACC, covering:

* [User Experience] A seamless UI for job submissions and management, supporting both container and Slurm format, all on the same backbone
* [Resource Management] Multi-tenant allocation with configurable strategies, using CNCF HAMi and Kueue
* [Performance and Scalability] A robust distributed infrastructure with networked storage and RDMA, via CNCF SpiderPool,Fluid...

Peter Pan

Cloud-Native Developer , Open Source Enthusiast

Shanghai, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top