Debugging Data Skew in Flink - and Teaching the Pipeline to Investigate Itself

Running large-scale streaming systems in production often reveals behaviors that never appear in small tests or synthetic benchmarks. In this talk we share lessons learned while operating high-throughput Apache Flink pipelines at scale, where subtle distribution mechanics created unexpected performance bottlenecks. Through real production incidents we will explore how data skew can emerge from different layers of the streaming stack: operator semantics, runtime key partitioning, and upstream source distribution.

These experiences also raise a broader operational question: how do you continuously detect and investigate such issues in complex streaming systems?

In the second part of the talk, we explore a forward-looking approach to operating Flink at scale: embedding agents directly into the streaming environment to continuously analyze metrics, detect anomalies, and trigger automated investigations using LLMs and tool-based workflows. This enables the streaming platform itself to act as an intelligent observability layer, helping operators diagnose performance issues faster and reason about system behavior as workloads evolve.
Together, these perspectives combine practical production lessons with a glimpse into the future of AI-assisted operations for large-scale streaming systems.

Devora Roth Goldshmidt

Senior Software Architect, NICE Actimize

Ramat HaSharon, Israel

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Debugging Data Skew in Flink - and Teaching the Pipeline to Investigate Itself

Devora Roth Goldshmidt

Links

Actions