Speaker

Ismaël Mejía

Ismaël Mejía

Senior Cloud Advocate

Nantes, France

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data team. He has more than a decade of experience architecting systems for startups and financial companies. He has worked on distributed data-intensive frameworks, he is an active contributor of Apache Beam (Google Dataflow SDK) and Apache Avro among many other open-source projects and a member of the Apache Software Foundation.

Area of Expertise

  • Information & Communications Technology

Topics

  • Big Data
  • Data Integration
  • Scalability
  • Distributed Systems
  • Open Source Software
  • Software Engineering
  • Cloud Computing
  • Azure

TPC-DS and Apache Beam - the time has come!

TPC-DS is the de-facto SQL-based benchmark framework used to measure database systems and Big Data processing frameworks. Beam introduced an early TPC-DS implementation last year but so far we have not started to use it to measure the state of the performance of Beam.

In this talk we will introduce TPC-DS and how it works in general. We will present the different ways of running the TPC-DS benchmark on Beam via Beam SQL and “classical” Beam Java SDK, the issues that we have found trying to run TPC-DS on Beam and the current status of the project.

Also, we are going to discuss some issues related to Beam SQL, several performance optimisations, the challenges of fair benchmarking on distributed processing systems and how we expect to integrate TPC-DS with Beam’s CI tests to track regressions and improvements in the future.

Spark Runner (R)evolution

Apache Spark is one of the most popular analytics engines for large-scale data processing, it supports multiple cluster types, e.g. Hadoop, Mesos and Kubernetes. For these reasons it is an important target for Apache Beam. Existing Spark users can take advantage of Beam's rich unified model and its nice APIs without requiring big infrastructure changes or sacrificing their operational knowledge.

In this talk we present the history of the Spark runner in Apache Beam, we will discuss in detail how the runner works, we will focus in its current implementation and we will discuss some improvements and optimizations done in recent months.

This year has seen a renewed interest in the Spark runner due to two lines of work:
1. A new translation based on Spark's Structured Streaming API which allows a unification of Batch and Streaming in the runner code and let users benefit of some Spark engine optimizations and the new continuous processing mode.
2. Support for different languages (e.g. python) on the Spark runner by translating pipelines using Beam's new Portability APIs.
In this talk we will also present both efforts and motivate assistants to contribute to the ongoing Spark runner (r)evolution.

ApacheCon 2022

Chair for the Data Engineering Track.

Also presented:
* Dive into Avro: Everything a data engineer needs to know
* Growing your contributors base

October 2022

Ismaël Mejía

Senior Cloud Advocate

Nantes, France