SQL-on-Anything with distributed query engines

We continuously find new ways to generate and store more data. In the past it was easier to separate online workloads, such as interactive database queries, from offline analytical workloads, such as Hadoop jobs that could run for multiple minutes or hours. However, we increasingly find ourselves having to provide interactive access to large datasets, whether for research and analytics, or to drive the actual application UI. Furthermore, we keep finding new places to store all this data. So how do we query data that’s spread across multiple SQL databases, Elasticsearch clusters, and S3 buckets, ideally with a nice familiar query language? This is where the family of tools known as SQL-on-Hadoop comes in.
In this talk, we’ll look at distributed query engines, using Apache Drill, Spark SQL, and Facebook’s Presto as our go-to examples. These are some of the most widely used engines in the industry today, as they provide the best available compromise between speed, convenience, and availability for interactive queries over large amounts of data. We'll examine various use-cases, trade-offs, and integration strategies to bring together data from multiple sources. We’ll discuss how to store and manage data to make a bunch of files behave like a database using columnar storage formats. And finally, we will dive into the architecture of various query engines, as well as their managed cloud service incarnations.

David Ostrovsky

Software Engineer at Meta

Netanya, Israel

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

SQL-on-Anything with distributed query engines

David Ostrovsky

Links

Actions