Machine learning in the warehouse with Python and Snowpark

A machine learning pipeline contains many stages of data processing including, transformation, training, testing, and deployment.

As a data scientist moves through these stages, they will likely encounter a mishmash of different tools, languages, and environments (e.g., SQL, Python, CLI, Docker, Kubernetes, Notebooks, SQL editors, EC2, etc). In addition, datasets are frequently moved from client to server and back as models are trained, deployed, and retrained.

With Snowflake's new Python-based library, Snowpark, much of this complexity can be reduced for basic ML pipelines—especially when combined with Deepnote functioning as the EDA and DWH control surface.

In this talk, we will examine an end-to-end ML pipeline that takes data through the requisite stages mentioned above (i.e., transformation to deployment) using nothing but Snowpark and Deepnote. Together, the warehouse will function like an in-memory Pandas DataFrame with all database transformations, EDA, and ML code written in pure Python. We will demonstrate how to chain together lazily-computed databased operations as well as how to deploy Python-based ML models directly to Snowflake's compute environment—all while (mostly) leaving the data in the warehouse.

The goal is to introduce data scientists to Snowpark so that, for some ML pipelines, they can stay in the notebook and seamlessly move from transformation to production—using the tools, libraries, and environments they already use.

Allan Campopiano

Data Scientist @ Deepnote

View Speaker Profile