Andrei Ionescu

Senior Software Engineer, Adobe

Bucharest, Romania

Actions

Andrei Ionescu is a Senior Software Engineer with Adobe and part of Adobe Experience Platform's Data Lake team, specialised in Big Data and Distributed Systems with Scala, Java, Spark, Kafka.

At Adobe he is mainly contributing to Ingestion and Data Lake projects while on open source he's contributing to Delta Lake, Hyperspace, Apache Iceberg and more recently to Polars.

Area of Expertise

Information & Communications Technology
Media & Information
Region & Country

Topics

Big Data
Open Source Software
Apache Spark
Apache Iceberg
Data Lake
Apache Kafka
Data lake architecture
Data ingestion and ETL/ELT on a data lake
Analytics and Big Data
Open Data

Tracking and Triggering Pattern with Spark Stateful Streaming

Inside Adobe Experience Platform we noticed that multiple times we need to track actions happening at the control plane level and act upon them at lower levels like Data Lake, Ingestion processes, etc. Using Apache Spark Stateful Streaming we've been able to create services that, based on rules and conditons, act by starting processes like compacting data, consolidating data, cleaning data but not limited here, at the proper time minimising the process time while keeping everything under the defined SLAs. The cost of operation is minimal as the applications/services did require no attention, they are reliable, offer exactly once execution through Spark Stateful Streaming, auto-scaling by the way the pattern is architected, and high resiliency in case of downstream dependencies failures. This talk presents a pattern that we've been using in production for the last 2-3 years inside Adobe Experience Platform in multiple services and with no high-severity on-call interventions and minimal-to-none operational costs in these years while the services where used on high throughput ingestions flows.

Covering Indexes in the Data Lake with Hyperspace

At Adobe, we use the Iceberg table format inside the Adobe Experience Platform Data Lake. Although Iceberg offers the capability for file skipping, when the data is properly laid out and used, in many cases this is not enough, and the queries executed over the data take a long time to complete. Similar to an RDBMS use case where high latency on queries can be alleviated with additional indexing at the cost of some extra storage, in data lakes the same pattern can be used. Hyperspace is an early phase indexing subsystem for Apache Spark that introduces the ability for users to build indexes on data, and together with Iceberg it can bring major improvements in query response time – up to 25 times faster in some cases. Hyperspace accommodates our two major data flow use cases – stale datasets and fast-changing datasets – and assures consistency when used.

High-Frequency Small Files vs. Slow-Moving Datasets

Before implementing Apache Iceberg, we had a small file problem at Adobe. In Adobe Experience Platform's (AEP) data lake, one of our internal solutions was to replicate small files to use at a very high frequency of 50K files per day for a single dataset. A streaming service we called Valve processed those requests in parallel, writing them to our data lake and asynchronously triggering a compaction process. This worked for some time but had two major drawbacks. First, if the compaction process was unable to keep up, queries on the data lake would suffer due to expensive file listings. Second, with our journey with Iceberg underway, we quickly realized that creating thousands of snapshots per day for a single dataset would not scale. We needed an upstream solution to consolidate data prior to writing to Iceberg.

In response, a service called Flux was created to solve the problem of small files being pushed into a slow-moving tabular dataset (aka Iceberg v1). In this presentation, we will review the design of Flux, its place in AEP's data lake, the challenges we had in operationalizing it and the final results.

Andrei Ionescu

Senior Software Engineer, Adobe

Bucharest, Romania

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Andrei Ionescu

Actions

Links

Area of Expertise

Topics

Sessions

Tracking and Triggering Pattern with Spark Stateful Streaming

Covering Indexes in the Data Lake with Hyperspace

High-Frequency Small Files vs. Slow-Moving Datasets

Andrei Ionescu

Links

Actions