Why We Replaced S3FS: Lessons from Building a Better Filesystem for AI Workloads on Kubernetes

Mounting S3-compatible storage via S3FS seems like an easy way to enable POSIX-like access in Kubernetes. But in real AI/ML workloads—e.g., training with PyTorch or TensorFlow—we hit major issues: crashes from incomplete writes, vanished checkpoints, inconsistent metadata, and unpredictable I/O latency.

This session shares our journey from debugging S3FS failures to deploying a scalable, POSIX-compliant file system that still leverages object storage. We’ll cover:

- Benchmarks comparing S3FS and a user-space distributed FS
- I/O traces showing metadata and small file pain points
- Key design decisions for compatibility and performance
- Kubernetes CSI and Operator integration for scale
- Lessons from running it on 1,000+ node AI training clusters

Ideal for platform engineers, MLOps, and Kubernetes architects seeking reliable, scalable storage for data-heavy workloads.

Rui Su

Open-source advocate and co-founder of JuiceFS, a cloud-native distributed file system

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Why We Replaced S3FS: Lessons from Building a Better Filesystem for AI Workloads on Kubernetes

Rui Su

Links

Actions