Towards Predictable Tail Latency in Apache Kafka on Confluent Cloud

End-to-end tail latency is the 99th percentile of the round-trip time for a message through Apache Kafka. This is critical for many who depend on Apache Kafka for their real time application needs. Spikes in tail latency have noticeable ripple effects such as application retries, throttling, unavailability and general application instability leading to SLO violations across the stack. Guaranteeing predictable latency, let alone tail latency, is challenging. This is due to variability in cloud infrastructure and service-level guarantees across cloud providers, and limited observability across the full stack. It is compounded by unpredictability in the factors contributing to the degradation in tail latency.

Embarking on the journey towards predictable tail latency in Apache Kafka required a bold strategy of recreating otherwise unpredictable scenarios by simulating faults, leveraging in-depth tooling to trace the flow of a message through a broker including the underlying infrastructure, diving deeper into Apache Kafka’s log layer to analyze its bottlenecks and finally making the necessary changes to achieve predictable tail latency.

Tune in to learn how we demystified the filesystem, page cache, the JVM and eventually Apache Kafka to investigate and improve the stack for a more predictable tail latency.

Alok Nikhil

Staff Software Engineer, Confluent

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Towards Predictable Tail Latency in Apache Kafka on Confluent Cloud

Alok Nikhil

Links

Actions