© Mapbox, © OpenStreetMap

Speaker

Hubert Dulay

Hubert Dulay

Developer Advocate at StarTree

New York City, New York, United States

Actions

Hubert Dulay is an O'Reilly author of “Streaming Data Mesh” and "Streaming Databases" (early access). He is a veteran engineer with over 20 years of experience in big and fast data. Hubert has compiled his experiences with data from his time while consulting for many financial institutions, healthcare organizations, and telecommunications companies, providing simple solutions that solved many data problems.

Area of Expertise

  • Information & Communications Technology

Topics

  • Data Science & AI
  • Machine Learning and AI
  • Data Streaming
  • Data Science
  • Stream Analytics
  • Realtime Analytics

How to Implement a Streaming Data Mesh

Data mesh is one of the most popular architectures for data platforms that many are exploring today. This session will help you get an understanding of this self-servicing data platform in a streaming context using Apache Flink SQL or a streaming database.

Today, data meshes are often implemented as batch pipelines. This session will show why a streaming data mesh is the better approach.

SQL is the universal language for data. It makes data accessible to a large spectrum of data related roles. SQL enables these data personas to use the power of SQL on streaming data.

One of the hardest parts to building any data mesh is to enable domains that are new to data the ability to build data products. They often don’t have the tools and sometimes lack the skills needed to replace a data engineer. This session will introduce Flink SQL as an easy way to enable domains to be good tenants in a streaming data mesh.

Real-Time RAG with Apache Flink and Pinot

This session introduces the basics of vector indexes and vector databases and all the basic knowledge of how to use them. We will walk through the steps to setting up a RAG (Retrieval-Augmented Generation) data pipeline into Apache Pinot ( a real-time OLAP and vector store). We'll create embeddings from unstructured data using Flink and store them in Pinot. Then start asking questions about the data we just loaded.

Technical details covered:
- Vector Indexes
- Apache Pinot
- Creating embeddings using Flink
- RAG
- Learn how to create embeddings and perform similarity searches.

Takeaway:
How we can combine similarity search with real-time analytics.

Target audience:
This session is for anyone interested in RAG, AI, vector databases, and real-time processing.

The Streaming Plane

Zhamack Dehghani nicely described the architectural data planes. In the dynamic landscape of data management, the concept of the "data divide" has emerged as a pivotal idea that highlights the crucial distinction between two essential components: the operational data plane and the analytical data plane. This concept is particularly relevant in today's data-driven world, where organizations strive to extract maximum value from their data assets. Understanding the data divide between these two planes is fundamental for devising effective strategies to manage, process, and derive insights from data.

The streaming plane connects the operational and analytical aspects of data processing. It captures and processes real-time data, allowing it to flow seamlessly into the analytical phase where it's stored, analyzed, and used for insights and decision-making. This bridge enables organizations to make quicker, data-driven decisions based on both real-time and historical data.

Why should someone care:
The Streaming Plane empowers organizations to access and analyze a broader spectrum of data types, enabling better-informed decisions in real-time and over time.

By merging real-time and historical data, you gain a more comprehensive and nuanced view of your operations, customers, and markets. This leads to deeper insights, helping you uncover trends, anomalies, and opportunities that might otherwise go unnoticed.

Technical details are covered:
Data Mesh
Real-Time OLAP (Apache Pinot)
Streaming Databases
Streaming Platforms
Connectors

Takeaways:
Attendees will get a good understanding of what the streaming plane is and how to implement one using the technologies stated above.

Vector Databases (pg_vector) and Real-Time Analytics

This session introduces the basics of vector indexes and vector databases and all the basic knowledge of how to use them. We will walk through the steps to setting up pg_vector, a vector extension for Postgres, and how to create embeddings from images. Then, we'll perform a similarity search on those images. We'll also cover the basics of distance algorithms and vector indexes.

Lastly, we'll go over how to use similarity search in real-time use cases and introduce Apache Pinot new vector index feature.

Technical details covered:
- Learn how to get started with pg_vector, the PostgreSQL vector extension.
- Learn how to create embeddings and perform similarity searches.

Takeaway:
How similarity search can be used in real-time

Target audience:
Architects and developers interested in vector databases, real-time OLAPs, and stream processing.

Undoing Kleppmann - Putting the database back together.

“Turning the database inside-out” was Martin Kleppmann’s approach to better understanding stream processing at scale. It introduced materialized views on the stream providing a database experience to streaming workloads. But there is another perspective to this idea: the database. From the database perspective, all the systems and features that support streaming workloads already exist in the database. So why not start streaming adoption from there?

**Why Should Someone Care?**
Streaming solutions face adoption challenges due to its inherent complexities. Real-time data, continuous streams, and distributed systems introduce intricacies that can be daunting for organizations. The need for handling events in sequence, managing out-of-order data, ensuring fault tolerance, and orchestrating complex workflows all contribute to the difficulty in adopting stream processing. Additionally, adapting existing infrastructures to accommodate the dynamic nature of streaming data often requires a significant shift in mindset and technology. Overcoming these hurdles demands a thorough understanding of the unique demands of streaming and a strategic approach to implementation.

**Technical Details Covered:**
- Materialized views
- IVM - incremental view maintenance
- Write ahead logs
- Stream databases
- HTAP databases
- Consistency issues with stream processing

**Takeaways:**
- Attendees can expect to gain an understanding of the strategic importance of getting developers to adopt streaming and real-time analytics by providing a familiar database experience.

- A solution model for real-time analytical use cases.

- All databases are streaming databases.

Real-Time RAG with Apache Pinot

This session introduces the basics of vector indexes and vector databases and all the basic knowledge of how to use them. We will walk through the steps to setting up a RAG (Retrieval-Augmented Generation) data pipeline into Apache Pinot ( a real-time OLAP and vector store). We'll load documents from a website and start asking questions about that website.

Technical details covered:
- Vector Indexes
- Apache Pinot
- RAG
- Learn how to create embeddings and perform similarity searches.

Takeaway:
How similarity search can be used in real-time

Target audience:
This session is for anyone interested in RAG, AI, vector databases, and real-time processing.

The Rise of Agentic AI in Real-Time Analytics

The concept of "agentic" AI refers to systems that can act autonomously and make decisions independently, marking a shift from passive to active AI applications. This post explores how agentic AI can enhance real-time analytics by allowing users to interact dynamically with data through various tools and agents, ultimately improving decision-making processes.

Takeaways:
- "Agentic" AI systems can operate autonomously, making decisions and taking actions similar to human agents.
- Traditional dashboards limit user interaction; agentic AI allows for more complex queries and insights beyond preset metrics.
- Hybrid search capabilities combine keyword-based and similarity searches for more accurate results.
- Identifying the right tools is crucial for developing effective agentic applications, enabling flexibility and adaptability to future needs.
- Leveraging AI and LLMs streamlines the process of data analysis, reducing the need for manual coding and enhancing efficiency.

Technologies:
- AI Agents
- Llamaindex
- Vector Databses
- OLAP technologies
- RAG pattern

P99 CONF 2024 Sessionize Event

October 2024

AI Community Conference - Boston 2024 Sessionize Event

September 2024 Cambridge, Massachusetts, United States

SQL Saturday Syracuse 2024 Sessionize Event

September 2024 Syracuse, New York, United States

DataTune 2024 Sessionize Event

March 2024 Nashville, Tennessee, United States

SQL Saturday Atlanta 2024 - BI & Data Analytics Sessionize Event

February 2024 Alpharetta, Georgia, United States

Hubert Dulay

Developer Advocate at StarTree

New York City, New York, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top