Himadri Pal

Principal Software Engineer - Data and AI at Apple

Cupertino, California, United States

Actions

I'm a software engineer and architect with 20 years of experience - on Data and AI. I'm also Apache Iceberg, Apache Spark, Apache Arrow, Apache Datafusion and Comet Contributor.

Area of Expertise

Information & Communications Technology

Topics

Big Data Machine Learning AI and Analytics
Big Data Compute
Big Data Analytics
Data Engineering
Data Science & AI
Data Warehousing

Anomaly detection at Apple for large scale data using Apache Spark and Flink

Anomaly detection in time series data is crucial for identifying unusual patterns and trends, enabling better alerting and action when data deviates from normal. Most anomaly detection algorithms perform adequately on a single node machine with public datasets, but do not scale well with distributed processing frameworks used in modern big data environments. This talk will focus on how we scaled anomaly detection for large-scale datasets using Apache Spark and Flink for both batch and near real time use cases. We will also discuss how we leveraged Apache Spark to parallelize and scale common anomaly detection algorithms, enabling support for large-scale data processing. We will highlight some of the challenges faced and how we resolved them to make it useful for massive datasets with varying degree of anomalies. Finally, we will demonstrate how our anomaly detection framework works in batch for petabytes of data and in streaming mode for 100s of thousands of transactions per second.

Iceberg and Storage Partition Join

This session explores Iceberg optimizations and, in particular, the pivotal use of Apache Spark Storage-Partition Joins (SPJ) on Iceberg tables. We will deep dive on how Spark SPJ can completely eliminate shuffle operations, which is essential to running our most resource-intensive jobs. We will explain the many Spark SPJ enhancements for Iceberg developed by Apple, going over when and how to enable them for different use cases. Finally, we will discuss our results and areas of enhancement for SPJ in the community.

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Apple has migrated large amounts of data stored in traditional big data warehouses into Apache Iceberg tables. Some of these tables grow continuously, with data added via streaming ingest as well as large snapshot jobs that execute row-level operations across an entire table. This session explores Iceberg optimizations and, in particular, the pivotal use of Apache Spark Storage-Partition Joins (SPJ) on Iceberg tables in our migrations. We will deep dive on how Spark SPJ can completely eliminate shuffle operations, which was essential to running our most resource-intensive jobs. We will explain the many Spark SPJ enhancements for Iceberg developed by Apple, going over when and how to enable them for different use cases. Finally, we will discuss our results and areas of enhancement for SPJ in the community.

Himadri Pal

Principal Software Engineer - Data and AI at Apple

Cupertino, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Himadri Pal

Actions

Links

Area of Expertise

Topics

Sessions

Anomaly detection at Apple for large scale data using Apache Spark and Flink

Iceberg and Storage Partition Join

Optimizing Analytic Workloads in Apple with Iceberg and Storage Partition Join

Himadri Pal

Links

Actions