
Andrey Falko
Lyft, Staff Software Engineer
Oakland, California, United States
Actions
Andrey Falko is a Staff Software Engineer at Lyft, where he has been for more than a year. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he researched Kafka and Pulsar performance and reliability. While there, he also built an IaaS system, many CI/CD systems, a Zipkin service, and features for the Salesforce platform.
Area of Expertise
Topics
Fault Tree Analysis Applied to Apache Flink
As Flink's adoption grows, we find more developers asking our small Flink infrastructure team for answers about whether their application will meet specific reliability guarantees. For example, can Flink maintain a data freshness guarantee lower than 5 minutes?
This session dives into how an age-old reliability technique can be applied to guide Flink platform and application developers who want to tune and monitor their Flink-based solutions and avoid over-promising and under-delivering for their users.
We present a calculator and step-by-step guide that we came up with to show what can be tuned to improve Flink application reliability. Throughout the session, we visualize failure probabilities by growing a Fault Tree in order to systematically find strengths and weaknesses with Flink.
Can Kafka Handle a Lyft Ride?
What does a Kafka administrator need to do if they have a user who demands that message delivery be guaranteed, fast, and low cost? In this talk we walk through the architecture we created to deliver for such users. Learn around the alternatives we considered and the pros and cons around what we came up with.
In this talk, we’ll be forced to dive into broker restart and failure scenarios and things we need to do to prevent leader elections from slowing down incoming requests. We’ll need to take care of the consumers as well to ensure that they don’t process the same request twice. We also plan to describe our architecture by showing a demo of simulated requests being produced into Kafka clusters and consumers processing them in lieu of us aggressively causing failures on the Kafka clusters.
We hope the audience walks away with a deeper understanding of what it takes to build robust Kafka clients and how to tune them to accomplish stringent delivery guarantees.
How to mutate your immutable log
Have you ever had your upstream producers write poisoned data that breaks your downstream consumers? Did Personal Identifiable Information (PII) land in a Kafka topic that wasn’t supposed to have it? Is your data pipeline under development and you simply want to iterate quickly? Immutability is one of the key and desirable features of Kafka. However, when mistakes happen and you are paged at night you sometimes wish there was an “easy button” to change the log.
This session first dives into some of the errors we have seen that caused outages for considerable durations of time. Recovery from the errors required late night code changes on consumers or simply waiting things out.
The next part of the session proposes a topic versioning scheme that allows us to recover from the examples that we mention. It segues into what it would take to build a control plane to manage and lifecycle these versioned topics. We’ll cover the benefits and pitfalls of our proposed solution.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top