Ismaël Mejía

Senior Cloud Advocate

Nantes, France

Actions

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data team. He has more than a decade of experience architecting systems for startups and financial companies. He has worked on distributed data-intensive frameworks, he is an active contributor of Apache Beam (Google Dataflow SDK) and Apache Avro among many other open-source projects and a member of the Apache Software Foundation.

Area of Expertise

Information & Communications Technology

Topics

Big Data
Data Integration
Scalability
Distributed Systems
Open Source Software
Software Engineering
Cloud Computing
Azure
AI
Agents

Apache XTable (incubating): Interoperability among Apache Iceberg, Hudi & Delta Lake

Apache Iceberg, Hudi & Delta Lake are leading open-source projects providing decoupled storage with advanced transaction & metadata layers, known as table formats, in cloud storage. When data is stored in a distributed file system, these formats offer similar features: a table abstraction over files that includes schema, commit history, partitions & column statistics. Choosing a table format is a critical decision for engineers & orgs, given the distinct features of each project tailored to different use cases.

Enter Apache XTable, a newly incubated Apache project, designed for omni-directional interoperability among table formats. XTable doesn't introduce a new format but enables translation of table format metadata, allowing data to be written in any chosen format & converted to target formats consumable by various compute engines. This talk will demonstrate how XTable simplifies selecting table formats & addresses the increasing need for interoperability in lakehouse architectures.

TPC-DS and Apache Beam - the time has come!

TPC-DS is the de-facto SQL-based benchmark framework used to measure database systems and Big Data processing frameworks. Beam introduced an early TPC-DS implementation last year but so far we have not started to use it to measure the state of the performance of Beam.

In this talk we will introduce TPC-DS and how it works in general. We will present the different ways of running the TPC-DS benchmark on Beam via Beam SQL and “classical” Beam Java SDK, the issues that we have found trying to run TPC-DS on Beam and the current status of the project.

Also, we are going to discuss some issues related to Beam SQL, several performance optimisations, the challenges of fair benchmarking on distributed processing systems and how we expect to integrate TPC-DS with Beam’s CI tests to track regressions and improvements in the future.

Spark Runner (R)evolution

Apache Spark is one of the most popular analytics engines for large-scale data processing, it supports multiple cluster types, e.g. Hadoop, Mesos and Kubernetes. For these reasons it is an important target for Apache Beam. Existing Spark users can take advantage of Beam's rich unified model and its nice APIs without requiring big infrastructure changes or sacrificing their operational knowledge.

In this talk we present the history of the Spark runner in Apache Beam, we will discuss in detail how the runner works, we will focus in its current implementation and we will discuss some improvements and optimizations done in recent months.

This year has seen a renewed interest in the Spark runner due to two lines of work:
1. A new translation based on Spark's Structured Streaming API which allows a unification of Batch and Streaming in the runner code and let users benefit of some Spark engine optimizations and the new continuous processing mode.
2. Support for different languages (e.g. python) on the Spark runner by translating pipelines using Beam's new Portability APIs.
In this talk we will also present both efforts and motivate assistants to contribute to the ongoing Spark runner (r)evolution.

ApacheCon 2022

Chair for the Data Engineering Track.

Also presented:
* Dive into Avro: Everything a data engineer needs to know
* Growing your contributors base

October 2022

Ismaël Mejía

Senior Cloud Advocate

Nantes, France

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Ismaël Mejía

Actions

Links

Area of Expertise

Topics

Sessions

Apache XTable (incubating): Interoperability among Apache Iceberg, Hudi & Delta Lake

TPC-DS and Apache Beam - the time has come!

Spark Runner (R)evolution

Events

ApacheCon 2022

Ismaël Mejía

Links

Actions