Session

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Apple has migrated large amounts of data stored in traditional big data warehouses into Apache Iceberg tables. Some of these tables grow continuously, with data added via streaming ingest as well as large snapshot jobs that execute row-level operations across an entire table. This session explores Iceberg optimizations and, in particular, the pivotal use of Apache Spark Storage-Partition Joins (SPJ) on Iceberg tables in our migrations. We will deep dive on how Spark SPJ can completely eliminate shuffle operations, which is essential to running our most resource-intensive jobs. We will explain the many Spark SPJ enhancements for Iceberg developed by Apple, going over when and how to enable them for different use cases. Finally, we will discuss our results and areas of enhancement for SPJ in the community.

Szehon Ho

Software Engineer at Databricks

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top