Session
From Topics to Tables: Cataloging Streaming Data in the Iceberg Ecosystem
Open table formats like Apache Iceberg have become the backbone of the lakehouse, but streaming data still arrives with fragmented metadata and governance: topics live in messaging systems, schemas live in registries, and tables live in catalogs—often stitched together by connectors and convention.
In this talk, we propose a Streaming Data Catalog pattern for the Iceberg ecosystem: a metadata layer that treats a continuous stream as a first-class catalog object—on par with tables—while maintaining an explicit stream↔table linkage (typically 1:1) so operational producers and analytical engines share a single, governed representation of the same data. We’ll cover the core metadata model (catalog / namespace / stream), the streaming-specific metadata that tables alone don’t capture (retention, offset-to-file mapping, consumer state, schema evolution contracts), and how the catalog becomes the “brain” that keeps protocol gateways stateless and interoperable.
We’ll also discuss practical interoperability: federating with existing schema registries, synchronizing with external metastores/data catalogs, and enabling multi-protocol ingestion (including lightweight HTTP/gRPC ingestion when deploying a full broker stack isn’t desirable). The goal is to spark a community conversation on standardizing streaming semantics around Iceberg tables—without reinventing the lakehouse.
David Kjerrumgaard
Committer on the Apache Pulsar Project | Published Author | International Speaker | Big Data Expert
Las Vegas, Nevada, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top