Szehon Ho

Software Engineer at Databricks

Actions

Szehon is a software engineer at Databricks, where he focuses on Apache Spark. Prior to this, he was at Apple, working on Apache Iceberg, and Criteo and Cloudera, working on Apache Hive. He is a committer and PMC in the Apache Hive and Iceberg projects.

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

The Apache Iceberg community is introducing native geospatial type support, addressing key challenges in managing geospatial data at scale, including fragmented geospatial formats and inefficiencies in storing large spatial datasets. This talk will delve into the development of the origins of the Iceberg geo type, its specification design, and future goals. We will examine the impact on both the geospatial and Iceberg communities, in introducing a standard data sharing, transactions, time travel, and schema evolution to the geospatial community, and enabling optimized geospatial analytics and storage for Iceberg users. We will also present a live demonstration of the Iceberg geo data type with Apache Sedona and Apache Spark, showcasing how it simplifies and accelerates geospatial analytics workflows and queries. Finally, we will also provide an in-depth look at its current capabilities and outline the roadmap for future developments, and offer a perspective on its role in advancing geospatial data management in the industry.

Apache Iceberg's Best Secret: A Guide to Metadata Tables

Apache Iceberg’s rich metadata is its secret sauce, powering core features like time travel, query optimizations, and optimistic concurrency handling. But did you know that this metadata is accessible to all, via easy-to-use system tables? In this talk, we will walk through real life examples of using metadata tables to get even more out of Iceberg. What is the last partition updated and when? Why are there too many small files? What Iceberg maintenance procedures can give us better query performance? We can even starting building more advanced systems like data audit and data quality. How many null values are being added per hour? What is the latency of data ingest over time? We will also cover metadata table performance tips and tricks, and ongoing improvements in the community. Whether you are already using Iceberg metadata tables or interested in getting started, attend this talk to learn how this under-utilized feature can help manage data tables more effectively than ever before.

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Apple has migrated large amounts of data stored in traditional big data warehouses into Apache Iceberg tables. Some of these tables grow continuously, with data added via streaming ingest as well as large snapshot jobs that execute row-level operations across an entire table. This session explores Iceberg optimizations and, in particular, the pivotal use of Apache Spark Storage-Partition Joins (SPJ) on Iceberg tables in our migrations. We will deep dive on how Spark SPJ can completely eliminate shuffle operations, which was essential to running our most resource-intensive jobs. We will explain the many Spark SPJ enhancements for Iceberg developed by Apple, going over when and how to enable them for different use cases. Finally, we will discuss our results and areas of enhancement for SPJ in the community.

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Apple has migrated large amounts of data stored in traditional big data warehouses into Apache Iceberg tables. Some of these tables grow continuously, with data added via streaming ingest as well as large snapshot jobs that execute row-level operations across an entire table. This session explores Iceberg optimizations and, in particular, the pivotal use of Apache Spark Storage-Partition Joins (SPJ) on Iceberg tables in our migrations. We will deep dive on how Spark SPJ can completely eliminate shuffle operations, which is essential to running our most resource-intensive jobs. We will explain the many Spark SPJ enhancements for Iceberg developed by Apple, going over when and how to enable them for different use cases. Finally, we will discuss our results and areas of enhancement for SPJ in the community.

Szehon Ho

Software Engineer at Databricks

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Szehon Ho

Actions

Links

Sessions

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

Apache Iceberg's Best Secret: A Guide to Metadata Tables

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Szehon Ho

Links

Actions