Speaker

Niall Langley

Niall Langley

Data Engineer / Platform Architect

Bristol, United Kingdom

Niall is a independent consultant specialising in data engineering & platform architecture. He has been working with the Microsoft Data Platform tools for over 13 years, these days Niall helps clients build robust, scalable data pipelines built around Databricks and the Lakehouse architecture.

Niall is actively engaged in the data community, blogging occasionally, and regularly speaking at user groups and conferences in the UK, including SQLBits in 2019 & 2020. Niall is a committee member for Data Relay, and has been a helper at SQL Bits for many years.

Area of Expertise

  • Information & Communications Technology

Topics

  • Microsoft Azure
  • Microsoft Data Platform
  • Azure Data Platform
  • Azure Data Factory
  • SQL
  • Modern Data Warehouse
  • Data Warehousing
  • Business Intelligence
  • Azure Databricks
  • Data Lake
  • Lakehosue
  • Data Engineering
  • PySpark
  • SparkSQL
  • Spark
  • Data Streaming
  • Delta Lake
  • Delta Live Tables
  • Apache Airflow

Easy Data Pipelines with Databricks Delta Live Tables

Delta Live Tables is a new framework available in Databricks that aims to accelerate building data pipelines by providing out of the box scheduling, dependency resolution, data validation and logging.

We'll cover the basics, and then get into the demo's to show how we can:
- Setup a notebook to hold our code and queries
- Ingest quickly and easily into bronze tables using Auto Loader
- Create views and tables on top of the ingested data using SQL and/or python to build our silver and gold layers
- Create a pipeline to run the notebook
- See how we can run the pipeline as either a batch job, or as a continuous job for low latency updates
- Use APPLY CHANGES INTO to upsert changed data into a live table
- Apply data validation rules to our live table definition queries, and get detailed logging info on how many records caused problems on each execution.

By the end of the session you should have a good view of whether this can help you build our your next data project faster, and make it more reliable.

50 min talk

Introduction to Performance Tuning on Azure Databricks

More and more organisations are building data platforms in the cloud, often utilising Spark and tools like Databricks to build data engineering pipelines. These distributed computing tools can can be incredibly powerful, processing huge datasets incredibly quickly, but can have a steep learning curve. Often teams migrating older on-premises data warehouses to cloud solutions like the Lakehouse rightly concentrate on getting good data over getting the best performance. But are you getting the most out of these shiny new tools?

In the cloud, data pipeline performance can have a big impact on monthly cost, making it much easier to to justify spending time getting things running faster and more efficiently. This talk aims to show you the common pain points when working with Spark using Databricks, showing you where to look, what to look for, and what can be done to improve things.

We’ll look at how the architecture of Spark is reflected in the Spark UI, and how to use the UI, along with query plans and the cluster metrics to get a good understanding of if whether you’re wringing all the performance you can out of your cluster, or burning cash on excess compute. We’ll cover a list of quick wins to improve performance, and then look at how to identify some common problems that hurt performance.

By the end of the session, you’ll know how to check if your data pipelines are running well, and if the clusters you have are fit for the job. You’ll hopefully have a few quick ways to improve performance, save some money or both! You’ll also know how monitor performance after making changes, so you can check if they made a difference, and if they did earn some kudos with your team.

50 Minute session, can be adjusted for either Fabric or Synapse SQL Pools

What New in Delta Lake - Deep Dive?

The Delta Lake file format is the foundation of the Lakehouse. In the past few years, the Delta Lake project has been one of the most active in the Spark ecosystem, with lots of new features added. But what do those new features mean for your data platform, what opportunities do they open up, and what do you need to do to take advantage of them?

This session starts with a quick overview of Delta Lake to ensure attendees are at the same level, and then dives into the latest features, showing how they work, how to use them, and when they are useful. We’ll cover:
- SQL merge improvements
- Using the Change Data Feed to read a stream of changes to a Delta table
- ‘CREATE TABLE LIKE’ syntax for creating empty tables from the schema of an existing table
- Shallow clones of tables for copying tables without copying data files
- Deletion vectors for better merge/update performance and GDPR compliance
- Table features, the metadata that describes the features a given Delta table supports
- File level statistics on specific columns to help with skipping files on read
- Delta Universal Format - allows the Delta table to be read as if it were an Iceberg table

By the end of the session attendee will have a better understanding of the great new features available, and how they can be used to improve their data platform.

Turbo Charge your Lake House with Spark Streaming on Azure Databricks

Streaming is one of the buzzwords used when talking about the Lakehouse. It promises to give us real time analytics by enabling a continual flow of data into our analytics platforms. It's being used to power real time processes as diverse as fraud detection, recommendation engines, stock trading, GPS tracking and social media feeds. However, for data engineers used to working with batch jobs this can be a big paradigm shift.

In this session we take a look at Spark Structured Streaming:
- How is it architected
- What can ingest
- How it handles state and late arriving data
- What is the latency and performance
- Stateless vs stateful joins

At the end of the session you'll have a good idea of what the hype around streaming actually means for your pipelines - can you improve latency and resiliency or reduce costs by implementing streaming pipelines.

50 Minute Session

Better ETL with Managed Airflow in ADF

Building complex data workflows using Azure Data Factory can get a little clunky - as you orchestration needs get more complex you hit limitations like not being able to nest loops or conditionals, running simple Python, bash or PowerShell scripts is difficult, and costs can grow quickly as you are charged per task execution. Recently another option become available, Managed Airflow in ADF.

Apace Airflow is a code-centric open-source platform for developing, scheduling and monitoring batch-based data workflows, built using the python language Data Engineers know and love. But until Managed Airflow, getting it working in Azure was a complex task for customers more used to PaaS services such as ADF, Databricks and Fabric. It is also an important ETL orchestrator on AWS and GCP, so cross cloud compatibility becomes simpler to achieve.

In this session we’ll look at what Airflow is, how it’s different from ADF, and what advantages Managed Airflow in ADF gives us. We talk about the idea of a DAG for building the workflow, and then work through some demos to show just how easy it is to use Python to write an Airflow DAG’s and import them into the Managed Airflow Environment as pipelines. We then dive into the excellent monitoring UI and find out just how easy is it to trigger a pipeline, view it to see the dependencies between tasks, and monitor runs.

By the end of the session attendees will have a good understanding of what Airflow is, when to use it, and how it fits into the Azure Data Platform.

50 min talk

SQLBits 2024 - General Sessions Sessionize Event

March 2024 Farnborough, United Kingdom

Data Relay 2022 Sessionize Event

October 2022

Niall Langley

Data Engineer / Platform Architect

Bristol, United Kingdom