Yerachmiel Feltzman
Senior Big Data Engineer @ Tikal
Tel Aviv, Israel
Actions
6+ years of experience in data-centric positions; worked on both sides of the pipeline, with a demonstrated history of working with distributed systems for data pipelines, from architecture design to implementation, through performance debugging, monitoring, and deployment.
Area of Expertise
Topics
The Apache Arrow revolution
Apache Arrow revolutionizes the way data solutions are built.
It was never easier to interconnect different data tools enjoying zero-copy reads for lightning-fast data access. Apache Arrow is powering an extensive list of open and closed-source projects you might already be using: PySpark, Pandas, Polars, Dremio, Snowflake, Hugging Face, and the list goes on.
Let's take the time to understand what is Apache Arrow and why it's conquering the data world.
Ooops... I deleted the whole production table. What did I learn from it?
First of all, I got some new gray hair on the spot. :)
After that, I took a deep breath and called my tech lead to discuss the issue. We ended-up recovering in less than half a day to just coin our motto: "Write your code as if you’re going to delete production".
Let me take you through this nightmare that taught me 3 important lessons. 3 lessons that separate tech leads from the others.
Data-led prediction with Spark and MLFlow
Explore the capabilities of Apache Spark paired with MLFlow, a comprehensive platform for managing the end-to-end machine learning lifecycle. Understand how the combination of these two open-source solutions can effectively allow applying the data-led machine learning prediction architecture.
Throughout the discussion, we'll examine the different architectures to expose ML models in production. We'll then focus on the data-led approach, demoing in practice how to implement it with Spark and MLFlow.
*Target audience:* data engineers, data scientists, ML engineers, and backend developers who work with machine learning deployment and/or MLOps.
*Technical level:* intermediate. We will discuss ML models exposure architectures, and show implementation code.
*Duration:* flexible from 30 to 1 hour (the depth will be adapted based on the available time.
The session is based on the real-life experience of bringing several ML models from the data scientists' hands to production, using Spark and MLFlow.
I've also shared this experience in a series of three Medium articles:
1 - https://itnext.io/intro-to-mlops-model-life-cycle-from-a-data-engineers-eyes-b9347440fae4?source=friends_link&sk=e313e9855176ba85064408d8251fd50b
2 - https://medium.com/israeli-tech-radar/avoid-the-ml-dependencies-syncing-black-hole-2de061c1870e?source=friends_link&sk=8ad64bb408f9d172c422ddd528ac5a99
3 - https://medium.com/israeli-tech-radar/uncovering-mlflows-spark-udf-e46603971afa?source=friends_link&sk=12ad3474db4d64c789dab8171aa8de74
Action-Position data quality assessment framework
"I deleted 20k items from prod" - said David (fake name), the backend team leader. He mistakenly triggered a deletion data pipeline by a wrong configuration. "Yeah David, I've once deleted an entire table - don't worry, we will help you fix this."
How could have David avoided this by designing data quality gates for his pipeline?
What are the possible patterns he could have used?
Let's build together a practical framework to help us reason about and design data quality for our data pipeline.
This talk is based on a talk I gave to backend and data engineers at Tikal.
It's also based on an article I published that got attention in the data community. It was featured in two publications, and one podcast.
Data Engineering Weekly by Ananth P.: https://lnkd.in/d7HEekk4
Modern Data Stack: https://lnkd.in/dwHMTdjB
Data eng weekly radio:
https://open.spotify.com/episode/29YPTJeeYMZqGDImGgVJba?si=jf-a3JBrSQiqm-PEwvLxFg
Lambda architecture is dead. Long-live Lambda!
Streaming data pipelines are conquering the world. Jay Kreps's Kappa architecture, first made public in 2014 (!) in his "Questioning the Lambda Architecture" article, has become part of the mainstream.
In this talk, we'll let Kappa and Lambda data pipeline architectures battle momentarily while we explain the two, to finally propose a solution where they'll live in peace.
We'll also see a real-life example where we deployed the proposed architecture using Mongo, Kafka, Debezium, and Apache Spark.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top