Session
Why We Replaced S3FS: Lessons from Building a Better Filesystem for AI Workloads on Kubernetes
Mounting S3-compatible storage via S3FS seems like an easy way to enable POSIX-like access in Kubernetes. But in real AI/ML workloads—e.g., training with PyTorch or TensorFlow—we hit major issues: crashes from incomplete writes, vanished checkpoints, inconsistent metadata, and unpredictable I/O latency.
This session shares our journey from debugging S3FS failures to deploying a scalable, POSIX-compliant file system that still leverages object storage. We’ll cover:
- Benchmarks comparing S3FS and a user-space distributed FS
- I/O traces showing metadata and small file pain points
- Key design decisions for compatibility and performance
- Kubernetes CSI and Operator integration for scale
- Lessons from running it on 1,000+ node AI training clusters
Ideal for platform engineers, MLOps, and Kubernetes architects seeking reliable, scalable storage for data-heavy workloads.
Rui Su
Open-source advocate and co-founder of JuiceFS, a cloud-native distributed file system
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top