Covering Indexes in the Data Lake with Hyperspace

At Adobe, we use the Iceberg table format inside the Adobe Experience Platform Data Lake. Although Iceberg offers the capability for file skipping, when the data is properly laid out and used, in many cases this is not enough, and the queries executed over the data take a long time to complete. Similar to an RDBMS use case where high latency on queries can be alleviated with additional indexing at the cost of some extra storage, in data lakes the same pattern can be used. Hyperspace is an early phase indexing subsystem for Apache Spark that introduces the ability for users to build indexes on data, and together with Iceberg it can bring major improvements in query response time – up to 25 times faster in some cases. Hyperspace accommodates our two major data flow use cases – stale datasets and fast-changing datasets – and assures consistency when used.

Andrei Ionescu

Senior Software Engineer, Adobe

Bucharest, Romania

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Covering Indexes in the Data Lake with Hyperspace

Andrei Ionescu

Links

Actions