Speaker

David Ostrovsky

David Ostrovsky

Software Engineer at Meta

Netanya, Israel

Actions

At age 9 little David found an old book called "Electronic Computational Machines" at the library and, after reading it in a single weekend, decided that this was what he wanted to do with his life. Three years later he finally got to touch a computer for the first time and discovered that it was totally worth the wait. One thing led to another and now he’s a software engineer at Meta. David is a software developer with over 25 years of industry experience, speaker, trainer, blogger and co-author of “Pro Couchbase Server”. He specializes in large-scale distributed system architecture.

Area of Expertise

  • Information & Communications Technology

Topics

  • cloud-native software architecture
  • Big Data
  • Architecture

Modern Data Architecture: Breaking the Big Data Monolith

Over the past decade most companies transitioned to microservice based architectures, successfully applying domain driven design techniques to their applications and services. But, while some engineers were decomposing monoliths into microservices, others worked hard to create the biggest monolith of them all: the data lake. Although this centralized data platform model works for use-cases with simpler domains, it falls short whenever we have to deal with a large number of data sources, multiple teams, diverse consumers, and changing data requirements. The latest iteration in distributed data architecture is known as a Data Mesh and it solves many of the ills of previous centralized approaches.

In this talk we’ll transfer the lessons we learned from microservices to data architecture. How to apply domain driven design and product-oriented thinking to data management. How to make our data products self-describing, discoverable, and secure. How to build tests to ensure data quality and completeness. And what technologies and techniques can help us achieve our vision of agile, distributed data architecture.

Analytics for not-so-big data with DuckDB

In the past decade the industry has seen hundreds of new databases. Most of these newcomers are operational databases, meant for online workloads and being a primary datastore for applications. A handful of new databases are meant for analytical use-cases, mainly large scale big data workloads. Which makes DuckDB an interesting exception, because it's built for workloads that are too big for traditional databases, but not so big that they justify complicated big data tools. It's a lightweight, open-source, analytical database for people with gigabytes or single terabytes of data, not companies with hundreds of terabytes and teams of data engineers.

In this session we'll take DuckDB out for a test drive with live demos and discussion of interesting use-cases. We'll see how to use it to quickly run analytical queries on data from multiple data sources. We'll look at how to use DuckDB to transform and manipulate diverse datasets, such as turning a bunch of raw CSV data in S3 into a set of tables in MySQL with a single command. We'll check out its embedded capabilities, by running the database directly inside a Python application. And finally, we'll build a quick-and-dirty Data Lake by using DuckDB, without any complicated big data tools.

I'm not affiliated with DuckDB in any way, I just think it's a cool technology that fills an interesting niche in the data ecosystem and more people should be aware of its potential.

SQL-on-Anything with distributed query engines

We continuously find new ways to generate and store more data. In the past it was easier to separate online workloads, such as interactive database queries, from offline analytical workloads, such as Hadoop jobs that could run for multiple minutes or hours. However, we increasingly find ourselves having to provide interactive access to large datasets, whether for research and analytics, or to drive the actual application UI. Furthermore, we keep finding new places to store all this data. So how do we query data that’s spread across multiple SQL databases, Elasticsearch clusters, and S3 buckets, ideally with a nice familiar query language? This is where the family of tools known as SQL-on-Hadoop comes in.
In this talk, we’ll look at distributed query engines, using Apache Drill, Spark SQL, and Facebook’s Presto as our go-to examples. These are some of the most widely used engines in the industry today, as they provide the best available compromise between speed, convenience, and availability for interactive queries over large amounts of data. We'll examine various use-cases, trade-offs, and integration strategies to bring together data from multiple sources. We’ll discuss how to store and manage data to make a bunch of files behave like a database using columnar storage formats. And finally, we will dive into the architecture of various query engines, as well as their managed cloud service incarnations.

Synchronous vs asynchronous microservice architecture

When we set out to build a microservice based system we usually design our services to communicate with each other synchronously, often using a request-response pattern over HTTP or gRPC. This makes perfect sense in most cases because synchronous code is easier to debug, easier reason about, many common use-cases involve calling web APIs, and there are excellent tools available. However, as we continue to build our product, we eventually reach the edges of what’s possible with the original model. Maybe it’s because the call tree between the services grows too large and hard to maintain, maybe it’s because we add more and more work to be done in-line which inflates latency. And maybe we discover that our use-case wasn’t a good fit for a synchronous system, it just worked well enough at small scale, but we should have been building an event-based architecture from the start. Whatever the case may be, we are now faced with the need to change the system architecture, which is difficult and expensive.

In this talk we’ll explore various use-cases for synchronous vs. asynchronous microservice architectures, their advantages and disadvantages, differences in scalability, availability and observability, as well as how to identify when we should start planning a migration from one to the other. We’ll talk about asynchronous system patterns like CQRS, Event Sourcing, and Data Busses. And we’ll address how to gradually evolve towards an asynchronous model, because no sensible business is going to just stop everything for six months while we rewrite large chunks of the infrastructure.

Understanding Big Data for Software Engineers

Data warehouses, data lakes, data lake-houses (yes, that's a thing!), data meshes, data marts, OLAPs, streams, SQL, NoSQL, and hundreds of technologies whose names sound like Pokemon. That is what greets a software developer who wants to understand the world of big data. But as companies become more data driven and data rich, we as software developers have to evolve. We go from data users, to power users, and ultimately to data practitioners - engineers who understand and can leverage the full power of modern data tools.

In this talk we'll dive head first into the lake of data! We're going to figure out what the different technologies and paradigms mean, what problems they solve, and how they fit together to create the data architecture that underlies most modern distributed software systems.

Modern Decentralized Data Architecture

Over the past decade most companies transitioned to decentralized, microservice-based architectures, successfully applying domain driven design techniques to their applications and services. But, while some engineers were decomposing monoliths into microservices, others worked hard to create the biggest monolith of them all: the data lake. While the centralized data platform model works for use-cases with simpler domains, it falls short whenever we have to deal with a large number of data sources, multiple teams, diverse consumers, and changing data requirements.

In this talk we’ll transfer the lessons we learned from a decade of microservices to the latest iteration in decentralized data architecture: the Data Mesh. We'll learn how to apply domain driven design and product-oriented thinking to data management. How to make our data products self-describing, discoverable, and secure. How to build tests to ensure data quality and completeness. And what technologies and techniques can help us achieve our vision of agile, distributed data architecture.

Levels and Expectations

What makes a software developer senior? What lies beyond senior? Do you have to become a manager to continue your career growth, or is there another way? As a junior developer, how do you take control of your destiny and make deliberate career progress? What can your manager do to help you grow as an engineer? And what are hiring managers evaluating when looking to fill a particular role?
While the job titles may differ across companies and countries, the essence of what we expect from engineers remains the same.

This talk brings together a software engineer and a manager, with close to 40 years of combined industry experience between them, to give you their perspectives on career growth for software engineers. We’ll try to define practical milestones to aim for on your career journey. And we’ll talk about how to set goals and expectations for ourselves, and enlist our managers and other engineers to help us level up.

Co-presenting with Josef Goldstein, where I give the software engineer perspective, while he gives the engineering manager perspective.

Observability in Distributed Systems

Tell me if this sounds familiar: you have a web service, that calls another service, that sends a Kafka message to a third service, that writes something to a database. Except sometimes it doesn’t. Where did the message go? Did the client not send it? Or did Kafka eat it? You don’t know. You look in the logs, but there are so many logs! You try to reproduce the problem, but annoyingly everything works fine. What to do?

In this talk we’ll explore mechanisms for observing and debugging distributed systems, with an eye towards taking an existing codebase that lacks observability and evolving it over time. In particular, we’ll focus on distributed tracing tools that let us track transactions which span multiple services and execution contexts. We’ll discuss how tracing differs from logging and monitoring. How to instrument applications to emit trace data, how to collect and store it, how to visualize transactions, and how this benefits developers, devops, and the business itself. We’ll look at leveraging popular open source technologies, like the CNCF OpenTracing project, Jaeger, Zipkin, and the newly released Elastic APM OpenTracing bridge.

A Brief History of Big Data

In this talk we’ll go back over 20 years to the dawn of the modern big data age and look at the engineering challenges, technological advancements, and paradigm shifts that brought us from the humble relational databases of two decades ago to the massively distributed cloud data architectures that we have today.

Build Stuff 2024 Lithuania Sessionize Event

November 2024 Vilnius, Lithuania

Build Stuff 2023 Lithuania Sessionize Event

November 2023 Vilnius, Lithuania

NDC Oslo 2023 Sessionize Event

May 2023 Oslo, Norway

Build Stuff 2022 Lithuania Sessionize Event

November 2022 Vilnius, Lithuania

Build Stuff 2021 Lithuania Sessionize Event

November 2021 Vilnius, Lithuania

NDC London 2021 Sessionize Event

January 2021 London, United Kingdom

Build Stuff 2020 Lithuania Sessionize Event

November 2020

NDC Oslo 2019 Sessionize Event

June 2019 Oslo, Norway

NDC Oslo 2018 Sessionize Event

June 2018

David Ostrovsky

Software Engineer at Meta

Netanya, Israel

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top