Yaroslav Tkachenko

Software Engineer, Consultant, Advisor | Data Streaming Systems

Vancouver, Canada

Actions

Yaroslav Tkachenko is a Software Engineer, Consultant, and Advisor specializing in data streaming & data-intensive applications. Currently, Yaroslav is a Founder at Irontools, building tooling for data processing technologies and consulting companies in the data streaming space. Previously, Yaroslav was a tech lead at Shopify, Activision, and several startups.

Area of Expertise

Information & Communications Technology

Topics

Software Engineering
Data Engineering
Stream-processing
Software Architecture

Streaming SQL for Data Engineers: The Next Big Thing?

SQL is the lingua franca of data analysis, but should we use it more as data engineers?

Modern tools like dbt make it easier to express transformations in SQL, but streaming is more complicated than batch. Streaming pipelines usually require higher SLAs and many CI/CD and observability practices, so data engineers prefer to use familiar languages like Python, Java and Scala along with many useful frameworks and libraries. Can SQL replace that?

I was very skeptical when I first heard the idea of using SQL for writing somewhat complex stream-processing data application a few years ago. How do you unit test it? How do you version it?

Over the years, Spark SQL streaming, Flink SQL, ksqlDB and similar tools have matured, now they easily support complex stateful transformations. However, developer experience is still questionable: it’s easy to write a SQL statement, but how do you maintain it over the years as a long-running application?

In this presentation, I hope to share the discoveries I made over the years in this area, as well as working practices and patterns I’ve seen.

Storing State Forever: Why It Can Be Good For Your Analytics

State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much.

But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?

At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.

It's Time To Stop Using Lambda Architecture

Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.

In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.

Dynamic Change Data Capture with Flink CDC and Consistent Hashing

Change Data Capture (CDC) is a popular technique for extracting data from databases in realtime. However, many CDC deployments are static: e.g. a single connector is configured to ingest data for one or several tables.

At Goldsky, we needed a way to configure CDC for a large Postgres database dynamically: the list of tables to ingest is driven by customer-facing features and is constantly changing.

We started using Flink CDC connectors built on top of the Debezium project, but we immediately faced many challenges caused mainly by the lack of incremental snapshotting.

But even after implementing incremental snapshotting ourselves, we still faced an issue around using replication slots in Postgres: we couldn't use a single connector to ingest all tables (it's just too much data), and we couldn't create a new connector for every new set of tables (we'd quickly run out of replication slots). So we needed to find a way to maintain a fixed number of replication slots for a dynamic list of tables.

In the end, we chose a consistent hashing algorithm to distribute the list of tables across multiple Flink jobs. The jobs also required some customizations to support the incremental snapshotting semantics from Flink CDC.

We learned a lot about Debezium, Flink CDC and Postgres replication, and we're excited to share our learnings with the community!

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming

Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.

Current 2023: The Next Generation of Kafka Summit Sessionize Event

September 2023 San Jose, California, United States

Current 2022: The Next Generation of Kafka Summit Sessionize Event

October 2022 Austin, Texas, United States

Flink Forward Global 2021 Sessionize Event

October 2021

Kafka Summit Americas 2021 Sessionize Event

September 2021

Kafka Summit 2020 Sessionize Event

August 2020 Austin, Texas, United States

Yaroslav Tkachenko

Software Engineer, Consultant, Advisor | Data Streaming Systems

Vancouver, Canada

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Yaroslav Tkachenko

Actions

Links

Area of Expertise

Topics

Sessions

Streaming SQL for Data Engineers: The Next Big Thing?

Storing State Forever: Why It Can Be Good For Your Analytics

It's Time To Stop Using Lambda Architecture

Dynamic Change Data Capture with Flink CDC and Consistent Hashing

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming

Events

Current 2023: The Next Generation of Kafka Summit Sessionize Event

Current 2022: The Next Generation of Kafka Summit Sessionize Event

Flink Forward Global 2021 Sessionize Event

Kafka Summit Americas 2021 Sessionize Event

Kafka Summit 2020 Sessionize Event

Yaroslav Tkachenko

Links

Actions