From Bare Metal to Multi-Region AI Cloud: Building and Operating a 3,000-GPU Platform at FPT

Building production-grade AI infrastructure at scale requires more than provisioning GPUs — it demands a unified, automated platform spanning bare metal, VMs, and Kubernetes under a single tenant-aware fabric. This talk presents the engineering journey of FPT AI Factory in operating one of Southeast Asia's largest GPU clouds: ~3,000 NVIDIA H100, H200, A100, and A10 accelerators across three regions in Vietnam and Japan.

The session covers four pillars: (1) automation and multi-tenancy on InfiniBand and RoCE fabrics with PKey isolation and topology-aware scheduling; (2) a unified VPC model that bridges bare-metal, VMs, and Kubernetes via customized OVN as the network gateway; (3) managed Kubernetes services with GPU operator, MIG slicing, and NCCL-tuned scheduling; (4) production observability built on Prometheus, DCGM, and custom fabric exporters. Attendees will leave with actionable patterns from running 3,000 GPUs across three regions in production.

Sang Tran Quoc

Deputy Director of Cloud Infrastructure Service Development Center - FPT Smart Cloud

Ho Chi Minh City, Vietnam

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From Bare Metal to Multi-Region AI Cloud: Building and Operating a 3,000-GPU Platform at FPT

Sang Tran Quoc

Links

Actions