Allan Campopiano

Allan Campopiano

Data Scientist @ Deepnote

Data Scientist and avid coffee drinker

Transformation, is not transformation, is not transformation.

Many transformations are fine candidates for concretizing with dbt. But there are transformations that live in the data science world that are not well-suited for dbt—and probably for good reason. Consider the total set of all transformations, from mandatory pre-processing steps to sophisticated statistical transformations (e.g., converting data types versus computing robust measures of central tendency). The question quickly becomes: How do data teams decide which transformations to push down to dbt and which to leave up in the notebook?

In this panel discussion, analytics engineers, data engineers, and data scientists discuss what transformation means to them, where and when transformation happens in their stack, and how to collaborate effectively between high- and low-level forms of transformation. The goal will be to surface the mental models of data transformation, from each perspective, in order to help data teams draw their own lines. There is no one-size-fits-all definition of transformation, and this discussion explores many branches of the topic.

Machine learning in the warehouse with Python and Snowpark

A machine learning pipeline contains many stages of data processing including, transformation, training, testing, and deployment.

As a data scientist moves through these stages, they will likely encounter a mishmash of different tools, languages, and environments (e.g., SQL, Python, CLI, Docker, Kubernetes, Notebooks, SQL editors, EC2, etc). In addition, datasets are frequently moved from client to server and back as models are trained, deployed, and retrained.

With Snowflake's new Python-based library, Snowpark, much of this complexity can be reduced for basic ML pipelines—especially when combined with Deepnote functioning as the EDA and DWH control surface.

In this talk, we will examine an end-to-end ML pipeline that takes data through the requisite stages mentioned above (i.e., transformation to deployment) using nothing but Snowpark and Deepnote. Together, the warehouse will function like an in-memory Pandas DataFrame with all database transformations, EDA, and ML code written in pure Python. We will demonstrate how to chain together lazily-computed databased operations as well as how to deploy Python-based ML models directly to Snowflake's compute environment—all while (mostly) leaving the data in the warehouse.

The goal is to introduce data scientists to Snowpark so that, for some ML pipelines, they can stay in the notebook and seamlessly move from transformation to production—using the tools, libraries, and environments they already use.

Engineering structure with dbt + data science chaos with Deepnote

dbt brings best engineering practices into the world of data & analytics to help you build out a solid foundation. Data science notebooks help you leverage well-modelled data and create chaos in exploratory workflows. To make any data endeavors successful, you need both - tools that gives you strong foundation coupled with tools that help you experiment and break things. In this talk, we will introduce Deepnote, a collaborative data science notebook and how you can use it with dbt to introduce more collaboration into your team’s data modelling and exploratory workflows.

**Talk outline**

1. The case for notebooks as a part of the modern data stack
1a. Forces that are in play that make the notebooks paradigm relevant today
1b. The need for collaborative data workflows and how notebooks can help
2. Introducing Deepnote
2a. Our mission - bringing data teams together to explore, analyze and present data from start to finish
2b. Quick demo (SQL cells as a first-class citizen, real-time collaboration, integrations, code intelligence that helps you write production-ready code)
3. Using dbt with Deepnote
3a. How teams can operationalize Deepnote with dbt to introduce more collaboration into their engineering + exploratory workflows
3b. The exact project is still work in progress, we're defining this with our community. The foundational project can be found here: https://deepnote.com/project/Using-dbt-in-Deepnote-EVTMFcUdQ1WKrzGECkKzLA/%2Fnotebook.ipynb/#00012-88c753c7-8114-45ca-b36c-727a2bb51989

Allan Campopiano

Data Scientist @ Deepnote