Andrey Falko

Lyft, Staff Software Engineer

Oakland, California, United States

Actions

Andrey Falko is a Staff Software Engineer at Lyft, where he has been for more than a year. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he researched Kafka and Pulsar performance and reliability. While there, he also built an IaaS system, many CI/CD systems, a Zipkin service, and features for the Salesforce platform.

Area of Expertise

Information & Communications Technology

Topics

Kafka
Apache Kafka
Apache Flink
Jenkins
Jenkins Pipeline
CI/CD
CI/CD Pipelines
Automation & CI/CD

Fault Tree Analysis Applied to Apache Flink

As Flink's adoption grows, we find more developers asking our small Flink infrastructure team for answers about whether their application will meet specific reliability guarantees. For example, can Flink maintain a data freshness guarantee lower than 5 minutes?

This session dives into how an age-old reliability technique can be applied to guide Flink platform and application developers who want to tune and monitor their Flink-based solutions and avoid over-promising and under-delivering for their users.

We present a calculator and step-by-step guide that we came up with to show what can be tuned to improve Flink application reliability. Throughout the session, we visualize failure probabilities by growing a Fault Tree in order to systematically find strengths and weaknesses with Flink.

Can Kafka Handle a Lyft Ride?

What does a Kafka administrator need to do if they have a user who demands that message delivery be guaranteed, fast, and low cost? In this talk we walk through the architecture we created to deliver for such users. Learn around the alternatives we considered and the pros and cons around what we came up with.

In this talk, we’ll be forced to dive into broker restart and failure scenarios and things we need to do to prevent leader elections from slowing down incoming requests. We’ll need to take care of the consumers as well to ensure that they don’t process the same request twice. We also plan to describe our architecture by showing a demo of simulated requests being produced into Kafka clusters and consumers processing them in lieu of us aggressively causing failures on the Kafka clusters.

We hope the audience walks away with a deeper understanding of what it takes to build robust Kafka clients and how to tune them to accomplish stringent delivery guarantees.

How to mutate your immutable log

Have you ever had your upstream producers write poisoned data that breaks your downstream consumers? Did Personal Identifiable Information (PII) land in a Kafka topic that wasn’t supposed to have it? Is your data pipeline under development and you simply want to iterate quickly? Immutability is one of the key and desirable features of Kafka. However, when mistakes happen and you are paged at night you sometimes wish there was an “easy button” to change the log.

This session first dives into some of the errors we have seen that caused outages for considerable durations of time. Recovery from the errors required late night code changes on consumers or simply waiting things out.

The next part of the session proposes a topic versioning scheme that allows us to recover from the examples that we mention. It segues into what it would take to build a control plane to manage and lifecycle these versioned topics. We’ll cover the benefits and pitfalls of our proposed solution.

Andrey Falko

Lyft, Staff Software Engineer

Oakland, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Andrey Falko

Actions

Links

Area of Expertise

Topics

Sessions

Fault Tree Analysis Applied to Apache Flink

Can Kafka Handle a Lyft Ride?

How to mutate your immutable log

Andrey Falko

Links

Actions