Session
Managing a Large Fleet of Kafka Clusters on Heterogeneous Clouds with Safety and Efficiency
Confluent Cloud hosts thousands of our customer’s Kafka, KSQL, Connector, and Schema Registry clusters on heterogeneous clouds. Managing such a large fleet of clusters poses some key challenges: all cluster lifecycle management must be performed by us to reduce customer toil; new product, security and data governance features must be shipped at a regular, speedy cadency; customers require zero downtime or interruption for all operations.
In this talk, we will discuss the set of fleet management tools that we’ve created to safely and efficiently manage clusters, key challenges we faced, and other observations we encountered. Some takeaways include:
- Deploying all products (Kafka, KSQL, etc) and infrastructure (networking, k8s, etc) as individually updatable components
- Being able to pre-define rollout plans with canary support and having a Web UI portal to trigger, observe, and operate the rollout
- Configuring rich monitoring to validate clusters during rollouts
- Carefully orchestrating maintenances at the pod level, ensuring sufficient data replication and service availability
- Emitting ongoing progress events and notifications to end users
With this rich DevOps experience, operators can work on the entire fleet with confidence and efficiency, product teams can quickly ship features without impacting customer workloads, and customers can gain insights on maintenance management.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top