Speaker

Frank Munz

Frank Munz

Dr Frank Munz - Databricks

Munich, Germany

Actions

Dr. Frank Munz works on large-scale data and AI at Databricks. He authored three computer science books, built up technical evangelism for Amazon Web Services in Germany, Austria, and Switzerland, and once upon a time worked as a data scientist with a group that won a Nobel prize.

Frank realized his dream to speak at top-notch conferences such as Devoxx, Kubecon, and Java One on every continent (except Antarctica because it is too cold there). He holds a Ph.D. with summa cum laude in Computer Science from TU Munich. Enjoys skiing in the Alps, tapas in Spain, and exploring secret beaches in SE Asia.

Area of Expertise

  • Information & Communications Technology

Topics

  • Big Data
  • Big Data Machine Learning AI and Analytics
  • Machine Learning & AI
  • Apache Kafka
  • Data Lakehouse
  • data engineering
  • Data Analytics
  • Cloud Computig
  • Data Science & AI
  • Software Architecture
  • python

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse

Unlike just a few years ago, today the lakehouse architecture is an established data platform embraced by all major cloud data companies such as AWS, Azure, Google, Oracle, Microsoft, Snowflake and Databricks.

This session kicks off with a technical, no-nonsense introduction to the lakehouse concept, dives deep into the lakehouse architecture and recaps how a data lakehouse is built from the ground up with streaming as a first-class citizen.

Then we focus on serverless for streaming use cases. Serverless concepts are well-known from developers triggering hundreds of thousands of AWS Lambda functions at a negligible cost. However, the same concept becomes more interesting when looking at data platforms.

We have all heard about the principle "It runs best on Powerpoint", so I decided to skip slides here and bring a serverless demo instead:

A hands-on, fun, and interactive serverless streaming use case example where we ingest live events from hundreds of mobile devices (don't miss out - bring your phone and be part of it!!). Based on this use case I will critically explore how much of a modern lakehouse is serverless and how we implemented that at Databricks (spoiler alert: serverless is everywhere from data pipelines, workflows, optimized Spark APIs, to ML).

TL;DR benefits for the Data Practitioners:

-Recap the OSS foundation of the Lakehouse architecture and understand its appeal

- Understand the benefits of leveraging a lakehouse for streaming and what's there beyond Spark Structured Streaming.

- Meat of the talk: The Serverless Lakehouse. I give you the tech bits beyond the hype. How does a serverless lakehouse differ from other serverless offers?

- Live, hands-on, interactive demo to explore serverless data engineering data end-to-end. For each step we have a critical look and I explain what it means, e.g for you saving costs and removing operational overhead.

Perspectives, Projects, and Products - Top 3 Insights after 10 Years of Lessons Learned from the Ori

Join me for a concise yet insightful session about what's next in streaming based on three years of experience working on products with the original creators of Apache Spark, Spark Streaming, and MLFlow at as a Principal TMM engineer. In this purely technical session, data engineers can expect to learn the following:

** What is a data lakehouse on a technical level beyond the hype, why should I care and use it for streaming, and how is streaming data in the lakehouse different?

** Apache Spark and Project Lightspeed: Achieving sub-second latencies in stateless Spark Streaming and predictably low latencies with stateful queries. Learn about the current progress and understand how these advancements directly impact you.

** How many drag-and-drop tools have you seen designed for Spark streaming ETL, and why do they often lead to unmaintainable code? Learn why and how the original creator of Spark Streaming developed an SQL-based, serverless, declarative abstraction layer, drawing on a decade of his experience in Spark Streaming technology. This session will provide a comprehensive technical overview and an in-depth hands-on demo of Delta Live Tables and it's SQL-based Kafka ingestion, drawing from a decade of expertise in Spark Streaming technology.

Generative AI for Streaming Data Platforms - State of the Union

A classic data lakehouse is built on open-source table formats such as Delta.io, Iceberg, or Hudi and seamlessly integrates with big data platforms like Apache Spark and event busses like Apache Kafka. The popularity of the data lakehouse stems from its ability to combine the quality, speed, and simple SQL access of data warehouses with the cost-effectiveness, scalability, and support for unstructured data of data lakes. The success of the lakehouse OSS approach is driven by its low TCO and highlighted by its adoption by industry giants such as Amazon, Microsoft, Oracle, and Databricks.

With the advent of generative AI models and the potential of using techniques such as Retrieval-augmented generation (RAG) in combination with fine-tuning or pre-training custom LLMs, a new paradigm has emerged in 2023: AI-infused lakehouses. These platforms use generative AI for code generation, natural language queries, and semantic search, enhancing governance and automating documentation.

How do lakehouses, which are inherently capable of managing streaming data, adapt to the integration of new AI capabilities? Is AI in this context simply hype and marketing terminology, or is it a technology that – despite initial skepticism due to its catchy name (similar to terms like 'cloud computing', 'serverless', or 'lakehouses') – is already on the way to become widely adopted and transformative in the field?

Be surprised, join my lightning talk, and discover how AI capabilities can enhance real-time analytics and streamline ETL. Expect an interactive, hands-on, no-nonsense demonstration using Apache Kafka and the NY Taxi data set from Kaggle, concentrating on developer experience, operations, and governance.

From Zero to Hero: Sharing Huge Amounts of Streaming Data with Open Source Delta Sharing

This lightning talk is an introduction to Delta Sharing; A Linux Foundation open source solution for sharing massive amounts of data in a cheap, secure, scalable and *streaming* way.

Homegrown data-sharing solutions based on SFTP or APIs aren’t scalable and saddle you with operational overhead. Off-the-shelf data-sharing solutions only work on specific sharing networks, promoting vendor lock-in and can be costly. Others don't support streaming data.

Delta Sharing reliably accesses data at the bandwidth of modern cloud object stores, such as S3, ADLS, or GCS.

Any client supporting pandas, Apache Spark™, or Python, as well as commercial clients such as Power BI can connect to the sharing server. Clients always read the latest version of the data which can also be partitioned to limit the amount of data transferred.

Learn what you need to know about data sharing in 2023 in this lightning talk.

Streaming Data into your Lakehouse

The last years have taught us that cheap, virtually unlimited, and highly available cloud object storage doesn't make a solid enterprise data platform. Too many data lakes didn't fulfill their expectations and degenerated into sad data swamps.

With the Linux Foundation OSS project Delta Lake (https://github.com/delta-io), you can turn your data lake into the foundation of a data lakehouse that brings back ACID transactions, schema enforcement, upserts, efficient metadata handling, and time travel.

In this session, we explore how a data lakehouse works with streaming, using Apache Kafka as an example.

This talk is for data architects who are not afraid of some code and for data engineers who love open source and cloud services.

Attendees of this talk will learn:

1. Lakehouse architecture 101, the honest tech bits

2. The data lakehouse and streaming data: what's there beyond Apache Spark™ Structured Streaming?

3. Why the lakehouse and Apache Kafka make a great couple and what concepts you should know to get them hitched with success.

4. Streaming data with declarative data pipelines: In a live demo, I will show data ingestion, cleansing, and transformation based on a simulation of the Data Donation Project (DDP, https://corona-datenspende.de/science/en) built on the lakehouse with Apache Kafka, Apache Spark™, and Delta Live Tables (a fully managed service).

DDP is a scientific IoT experiment to determine COVID outbreaks in Germany by detecting elevated heart rates correlated to infections. Half a million volunteers have already decided to donate their heart rate data from their fitness trackers.

Frank Munz

Dr Frank Munz - Databricks

Munich, Germany

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top