Spark Runner (R)evolution

Apache Spark is one of the most popular analytics engines for large-scale data processing, it supports multiple cluster types, e.g. Hadoop, Mesos and Kubernetes. For these reasons it is an important target for Apache Beam. Existing Spark users can take advantage of Beam's rich unified model and its nice APIs without requiring big infrastructure changes or sacrificing their operational knowledge.

In this talk we present the history of the Spark runner in Apache Beam, we will discuss in detail how the runner works, we will focus in its current implementation and we will discuss some improvements and optimizations done in recent months.

This year has seen a renewed interest in the Spark runner due to two lines of work:
1. A new translation based on Spark's Structured Streaming API which allows a unification of Batch and Streaming in the runner code and let users benefit of some Spark engine optimizations and the new continuous processing mode.
2. Support for different languages (e.g. python) on the Spark runner by translating pipelines using Beam's new Portability APIs.
In this talk we will also present both efforts and motivate assistants to contribute to the ongoing Spark runner (r)evolution.

Ismaël Mejía

Senior Cloud Advocate

Nantes, France

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Spark Runner (R)evolution

Ismaël Mejía

Links

Actions