Most Active Speaker

Timothy Spann

Timothy Spann

Principal Developer Advocate for Data in Motion @ Cloudera

Princeton, New Jersey, United States

Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

https://www.datainmotion.dev/p/about-me.html
https://dzone.com/users/297029/bunkertor.html

Awards

  • Most Active Speaker 2022

Area of Expertise

  • Information & Communications Technology
  • Media & Information

Topics

  • apache nifi
  • apache flink
  • apache kafka
  • minifi
  • iot
  • AI
  • IOT and Android Things
  • IoT
  • Industrial IoT
  • IIOT
  • Deep Learning
  • cloud
  • Cloud & Infrastructure
  • Cloud & DevOps
  • Cloud data lake use cases
  • Big Data
  • AWS Databases
  • BI on the data lake
  • All things data
  • Streaming Data Analytics
  • Data Streaming
  • Event Streaming
  • Streaming
  • AWS Cloud Containers Kubernetes Streaming Data Big Data HPC
  • Apache Pulsar
  • AWS Data
  • AI STUFF: Big Data Quantum computing & Machine Learning
  • All things data in Azure AWS GCP and on-premises
  • Analytics and Big Data
  • AWS Data & AI
  • Azure Data
  • Iot Edge
  • #IoT
  • Automotive IoT
  • IoT Edge AI
  • Database
  • Data Science
  • Azure Data Platform
  • Azure SQL Database
  • Databases
  • Data Science & AI
  • Azure Data Factory
  • Data Visualization
  • Azure Data & AI
  • Data Warehousing
  • Data Management
  • Database Administration
  • data engineering
  • Data Platform
  • Azure Data Lake
  • Microsoft Data Platform
  • Power BI Dataflows
  • streaming sql

Building a Full Lifecycle Streaming Data Pipeline

In this talk, we will delve into the process of building a full lifecycle streaming data pipeline using Apache Airflow, Apache Kafka, and Apache Iceberg. We will cover the key features and capabilities of each tool, and demonstrate how they can be integrated to create a robust and efficient pipeline for handling real-time streaming data.

By combining the power of Apache Kafka, Apache Airflow, Apache NiFi and Apache Iceberg, developers can build a full lifecycle streaming data pipeline that is capable of efficiently handling real-time data at scale. This talk will provide a comprehensive overview of how to utilize these tools to build a reliable and effective streaming data pipeline.

Building a Real-Time IoT Application with Apache Pulsar and Apache Pinot

We will walk step-by-step with live code and demos on how to build a real-time IoT application with Pinot + Pulsar.

First, we stream sensor data from an edge device monitoring location conditions to Pulsar via a Python application.

We have our Apache Pinot "realtime" table connected to Pulsar via the pinot-pulsar stream ingestion connector.

Our data streams into the stream, and we visualize it with Superset.

https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apache-pinot-1e3baf8c1824

Source Code
https://github.com/tspannhw/pulsar-thermal-pinot

Reference
https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/apache-pulsar

https://dev.startree.ai/docs/pinot/recipes/pulsar

Sink Your Teeth into Streaming at Any Scale

Using the low-latency Apache Pulsar we can build up millions of streams of concurrent data and join them in real time with Apache Flink. We need an ultra-low latency database that can support these workloads to build next-generation IoT, financial and instant analytical transit applications

By sinking data into ScyllaDB we enable amazingly fast applications that can grow to any size and join with existing data sources.

The next generation of apps is being built now, you must choose the right low-latency scalable platform for these massively data-intensive applications. ScyllaDB + Pulsar + Flink is that platform. Choose Open, Choose Fast, and Make the right choice.

Building Modern Data Streaming Apps

In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.

In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there we build streaming ETL with Apache Spark and enhance events with Pulsar Functions for ML and enrichment. We build continuous queries against our topics with Flink SQL. We will stream data into ScyllaDB.

We use the best streaming tools for the current applications with FLiPN and FLaNK. https://www.flipn.app/

Deploying Machine Learning Models with Pulsar Functions

In this talk I will present a technique for deploying machine learning models to provide real-time predictions using Apache Pulsar Functions. In order to provide a prediction in real-time, the model usually receives a single data point from the caller, and is expected to provide an accurate prediction within a few milliseconds. 

Throughout this talk, I will demonstrate the steps required to deploy a fully-trained ML that predicts the delivery time for a food delivery service based upon real-time traffic information, the customer's location, and the restaurant that will be fulfilling the order.

Architecting Your First Event Driven Serverless Streaming Applications

Once you have built a topic in Apache Pulsar, you will quickly see the need to build event-driven applications. This can require a lot of decisions on what framework to use, where to run it, how to deploy it, and how to manage these applications.

I will walk you through step-by-step in building Pulsar Functions which is the easy way to design, test, develop, integrate, deploy, monitor, and manage serverless streaming applications in Java and Python.

Together we will build a full application as an Apache Pulsar function and enjoy the power of running it in the cloud for IoT events and add any routing, transformation, or machine learning that we need to accomplish our business requirements.

BUILD ML ENHANCED EVENT STREAMING APPLICATIONS WITH JAVA MICROSERVICES

In this talk we will walk through how to build event streaming applications as functions running in with cloud native messaging via Apache Pulsar that run on near infinite scale in any cloud, docker or K8. We will show you have to deploy ML functions to transform real-time data for IoT, Streaming Analytics and many other use cases. After this talk you will be able to build Java microservices with ease and deploy them anywhere utilizing the open source unified streaming and messaging platform, Apache Pulsar. Finally, we will show you have to add dashboards with Web Sockets, no code data sinks, integrate with Apache NiFi data pipelines, SQL Reports with Apache Spark and finally continuous ETL with Apache Flink. I have built many of these applications for many organizations as part of the FLiPN Stack. Let's build next generation applications today regardless if your data is REST APIs, Sensors, Logs, NoSQL Sources, Events or Database tables.

https://github.com/tspannhw?tab=repositories&q=FLiP&type=source

Building FLiPN Stack Edge AI Applications

Introducing the FLiPN stack which combines Apache Flink, Apache NiFi, Apache Pulsar and other Apache tools to build fast applications for IoT, AI, rapid ingest with Java, C#, Python or Golang.

FLiPN provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.

Apache Pulsar enables Java applications to communicate asynchronously at any scale, geo-replicate and interact with non JVM applications. Pulsar also acts as a function mesh to run Java functions as a FaaS triggered by Events. All of this is open source and includes an integrated Schema Registry with support for JSON, Avro, Text and ProtoBuf schemas.

Tools
Java, Golang, Python, C#, Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, Apache MXNet, DJL.AI

References
https://streamnative.io/blog/engineering/2021-11-17-building-edge-applications-with-apache-pulsar/
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
https://www.datainmotion.dev/2021/11/producing-and-consuming-pulsar-messages.html

Apache Pulsar Development 101 with Python

In this session I will get you started with real-time cloud native streaming programming with Python.

We will start off with a gentle introduction to Apache Pulsar and setting up your first easy standalone cluster. We will then l show you how to produce and consume message to Pulsar using several different Python libraries including Python client, websockets, MQTT and even Kafka.

After this session you will building real-time streaming and messaging applications with Python.

Ingesting Data at Scale into Elasticsearch with Apache Pulsar

One of the best things about Elasticsearch is its ability to handle large amounts of data and serve this data with sub-millisecond latency, which makes it an ideal platform to run analytics workloads. But like any purpose-built database, there are always trade-offs to consider. Elasticsearch's case is how to load the data continuously and at scale. A way to solve this problem is by using a buffer layer that can store and forward events to Elasticsearch. Apache Pulsar provides a great alternative to implement this layer.

This talk will explain how Pulsar can implement data ingestion, validation, aggregation, and storage and push this data to Elasticsearch using the sink connector. It will provide the necessary knowledge for you to ingest any data of data, such as logs, sensor data, and streaming events into Elasticsearch for analytics and visualization.

FLiP Into Apache Pulsar Apps with MongoDB

In this session, I will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming application with a variety of OSS libraries, schemas, languages, frameworks and tools against MongoDB. We will show you all the options from MQTT, Web Sockets, Java, Golang, Python, NodeJS, Apache NiFi, Kafka on Pulsar, Pulsar protocol and more. You will FLiP your lid on how much you learn in a short time. I will send out instructions on the few steps you need to get an environment ready to start building awesome apps. We'll also show you how to quickly deploy an app to a production cloud cluster with StreamNative.

Utilizing Apache Kafka, Apache NiFi and MiNiFi for EdgeAI IoT at Scale

A hands-on deep dive on using Apache Kafka, Kafka Streams, Apache NiFi + Edge Flow Manager + MiniFi Agents with Apache MXNet, OpenVino, TensorFlow Lite, and other Deep Learning Libraries on the actual edge devices including Raspberry Pi with Movidius 2, Google Coral TPU and NVidia Jetson Nano. We run deep learning models on the edge devices and send images, capture real-time GPS and sensor data. With our low coding IoT applications providing easy edge routing, transformation, data acquisition and alerting before we decide what data to stream real-time to our data space. These edge applications classify images and sensor readings real-time at the edge and then send Deep Learning results to Kafka Streams and Apache NiFi for transformation, parsing, enrichment, querying, filtering and merging data to various Apache data stores including Apache Kudu and Apache HBase.

https://www.datainmotion.dev/2019/08/updating-machine-learning-models-at.html

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp

As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed.

I will walk through how to get started, some use cases and demos and answer questions.

Hail Hydrate! From Stream to Lake with Pulsar and Friends

A cloud data lake that is empty is not useful to anyone.

How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before.

I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.

FLiP Stack for Cloud Data Lakes

Utilizing an all Apache stack for Rapid Data Lake Population and querying utilizing Apache Flink, Apache Pulsar and Apache NiFi.

We can quickly stream data to and from any datalake, data lake house, lakehouse, database or any datamart regardless of cloud or size. FLiP allows for Java and Python developers to build scalable solutions that span messaging and streaming in cloud native fashion with full monitoring.

Apache Pulsar with MQTT for Edge Computing

Today we will span from edge to any and all clouds to support data collection, real-time streaming, sensor ingest, edge computing, IoT use cases and edge AI. Apache Pulsar allows us to build computing at the edge and produce and consume messages at scale in any IoT, hybrid or cloud environment. Apache Pulsar supports MoP which allows for MQTT protocol to be used for high speed messaging.

We will teach you to quickly build scalable open source streaming applications regardless of if you are running in containers, pods, edge devices, VMs, on-premise servers, moving vehicles and any cloud.

Continuous SQL with Kafka and Flink

In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.

We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.

Apache NiFi 101: Introduction and Best Practices

https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
https://github.com/tspannhw/EverythingApacheNiFi
https://www.datainmotion.dev/2020/12/basic-understanding-of-cloudera-flow.html
https://www.datainmotion.dev/2020/10/top-25-use-cases-of-cloudera-flow.html

In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker or in CDP Public Cloud.

I will cover:
Terminology
Flow Files
Version Control
Repositories
Basic Record Processing
Provenance
Backpressure
Prioritizers
System Diagnostics
Processors
Process Groups
Scheduling and Cron
Bulletin Board
Relationships
Routing
Tasks
Networking
Basic Cluster Architecture
Listeners
Controller Services
Remote Ports
Handling Errors
Funnels

Real-Time Streaming in Any and All Clouds, Hybrid and Beyond

Description
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.

Tools:
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.

References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html

Source Code: https://github.com/tspannhw/MmFLaNK

Tags
AI + Machine Learning Databases Developer Tools Hybrid Integration Internet of Things

Real-Time Streaming in Azure

Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.

Tools:
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.

References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html

Source Code: https://github.com/tspannhw/MmFLaNK

Pack Your Bags, We’re Going on a Data Journey!

This three-hour workshop is aimed at organizations who have (or are about to) embark(ed) on their data journey, and are looking for guidance on best practices, tools, and recommendations on navigating through the full data science lifecycle from collection to visualization.

Participants will be exposed to a variety of speakers and data experts to illuminate the critical elements that go into making their data journey a success. The session will kick off with a keynote speaker that will provide an overview of the data journey, followed by a hands-on demonstration highlighting the various personas needed in a data team participating in this journey. The demo will also showcase some of the open-source tools used by experts in the field, while using datasets and use cases relevant to nonprofits. Finally, participants will rotate between breakout sessions to further explore each of these tools and personas, and to give them an opportunity to speak with data specialists who can help address their specific data questions and challenges.

Participants will leave this interactive workshop armed with a stronger understanding and a roadmap to embark on their data journey successfully. We will also be incorporating best practices and learnings from our successful workshop at NetHope 2019.

Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks

Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache Mm FLaNK stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and/or Apache Kafka for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Kafka topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi. Our final data will be stored in Apache Kudu via Apache NiFi for final SQL analytics.

Tools:
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, DJL.ai Apache MXNet, Apache Kudu, Apache Impala, Apache HDFS

References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html

Source Code: https://github.com/tspannhw/MmFLaNK

Using the Mm FLaNK Stack for Edge AI (Flink, NiFi, Kafka, Kudu)

Introducing the FLaNK stack which combines Apache Flink, Apache NiFi, Apache Kafka and Apache Kudu to build fast applications for IoT, AI, rapid ingest.

FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.

https://www.flankstack.dev/

Tools
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, Apache MXNet, Apache Kudu, Apache Impala, Apache HDFS

References
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html

Big Data Fest by SoftServe

May 2023

Real-Time Analytics Summit 2023

April 2023 San Francisco, California, United States

Devnexus 2023

April 2023 Atlanta, Georgia, United States

ScyllaDB Summit 2023

February 2023

Pulsar Summit Asia 2022

November 2022

2022 All Day DevOps

November 2022

AI DevWorld 2022

October 2022 San Jose, California, United States

Data on Kubernetes Day @ Kubecon / CloudNativeCon NA 2022

October 2022 Detroit, Michigan, United States

Current 2022: The Next Generation of Kafka Summit

October 2022 Austin, Texas, United States

JConf.dev 2022

September 2022 Chicago, Illinois, United States

SQLBits 2022

March 2022 London, United Kingdom

Elastic Community Conference 2022

February 2022

Scylla Summit 2022

February 2022

DeveloperWeek 2022

February 2022 Oakland, California, United States

GDG DevFest UK & Ireland

January 2022 London, United Kingdom

DataMinutes #2

January 2022

Automation + DevOps Summit

November 2021 Nashville, Tennessee, United States

AI DevWorld 2021

October 2021

API World 2021

October 2021

Scenic City Summit 2021

September 2021

Apache Con Global

September 2021 New Orleans, Louisiana, United States

Music City Tech 2021

September 2021

WorldFestival 2021

August 2021

Apache Con Asia

FLaNK

August 2021 Tokyo, Japan

AI and IoT Bulgaria Summit 2021

June 2021 Sofia, Bulgaria

AI DevWorld 2020

October 2020 San Jose, California, United States

NetHope Global Summit 2020

October 2020 New York City, New York, United States

Flink Forward Global Virtual 2020

October 2020

Timothy Spann

Principal Developer Advocate for Data in Motion @ Cloudera

Princeton, New Jersey, United States