Changing the Engine Mid-Flight: Zero-Downtime Ceph Upgrades

"Upgrade Ceph in our telco cloud with zero downtime." That is a mandatory requirement every 1–2 years to ensure security patches, bug fixes, and continued support from vendors and the community. We have a Ceph Cluster version 18.x.x with 10–50 nodes deployed on bare-metal running 5G Core workloads (AMF, UPF,...).
This talk covers a first-hand Reef → Squid upgrade with three real failure scenarios:
- Monitor quorum loss: root cause analysis, recovery sequence, and how to prevent quorum degradation during daemon rolling restarts
- OSD storms triggered by rebalancing that threatened cluster stability
- Incompatible client versions silently blocking the upgrade path

Beyond failure recovery, we'll share the upgrade sequencing strategy we developed — covering pre-flight checks, daemon upgrade ordering (MGR → MON → OSD → MDS/RGW).
Attendees leave with a reusable pre-upgrade checklist and sequencing framework for bare-metal Ceph in high-stakes environments.

Cuong Nguyen

Cloud Solution Engineer

Hanoi, Vietnam

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Changing the Engine Mid-Flight: Zero-Downtime Ceph Upgrades

Cuong Nguyen

Links

Actions