
Niall Langley
Data Engineer / Platform Architect
Bristol, United Kingdom
Actions
Niall is a independent consultant specialising in data engineering & platform architecture. He has been working with the Microsoft Data Platform tools for over 13 years, these days Niall helps clients build robust, scalable data pipelines built around Databricks and the Lakehouse architecture.
Niall is actively engaged in the data community, blogging occasionally, and regularly speaking at user groups and conferences in the UK, including SQLBits in 2019 & 2020. Niall is a committee member for Data Relay, and has been a helper at SQL Bits for many years.
Links
Area of Expertise
Topics
Easy Data Pipelines with Databricks Delta Live Tables
Delta Live Tables is a new framework available in Databricks that aims to accelerate building data pipelines by providing out of the box scheduling, dependency resolution, data validation and logging.
We'll cover the basics, and then get into the demo's to show how we can:
- Setup a notebook to hold our code and queries
- Ingest quickly and easily into bronze tables using Auto Loader
- Create views and tables on top of the ingested data using SQL and/or python to build our silver and gold layers
- Create a pipeline to run the notebook
- See how we can run the pipeline as either a batch job, or as a continuous job for low latency updates
- Use APPLY CHANGES INTO to upsert changed data into a live table
- Apply data validation rules to our live table definition queries, and get detailed logging info on how many records caused problems on each execution.
By the end of the session you should have a good view of whether this can help you build our your next data project faster, and make it more reliable.
50 min talk
Introduction to Performance Tuning on Azure Databricks
More and more organisations are building data platforms in the cloud, often utilising Spark and tools like Databricks to build data engineering pipelines. These distributed computing tools can can be incredibly powerful, processing huge datasets incredibly quickly, but can have a steep learning curve. Often teams migrating older on-premises data warehouses to cloud solutions like the Lakehouse rightly concentrate on getting good data over getting the best performance. But are you getting the most out of these shiny new tools?
In the cloud, data pipeline performance can have a big impact on monthly cost, making it much easier to to justify spending time getting things running faster and more efficiently. This talk aims to show you the common pain points when working with Spark using Databricks, showing you where to look, what to look for, and what can be done to improve things.
We’ll look at how the architecture of Spark is reflected in the Spark UI, and how to use the UI, along with query plans and the cluster metrics to get a good understanding of if whether you’re wringing all the performance you can out of your cluster, or burning cash on excess compute. We’ll cover a list of quick wins to improve performance, and then look at how to identify some common problems that hurt performance.
By the end of the session, you’ll know how to check if your data pipelines are running well, and if the clusters you have are fit for the job. You’ll hopefully have a few quick ways to improve performance, save some money or both! You’ll also know how monitor performance after making changes, so you can check if they made a difference, and if they did earn some kudos with your team.
50 Minute session, can be adjusted for either Fabric or Synapse SQL Pools
Streaming Jobs with Azure Databricks 101
Streaming is one of the buzzwords used when talking about the Lakehouse. It promises to give us real time analytics by enabling a continual flow of data into our analytics platforms. It's used to power real time processes as diverse as fraud detection, recommendation engines, stock trading, GPS tracking and social media feeds. However, for data engineers used to working with batch jobs this can be a big paradigm shift.
In this session we take a look at Spark Structured Streaming:
- When and why should we use it
- How is it works
- Aggregating data
- Joining streams
- Late arriving data
- Latency and performance
- Running streaming pipelines
At the end of the session you'll know when and why to use Spark streaming, and what gotchas to look out for as you start your journey with streaming pipelines.
50 Minute Session
5 Tips and Tricks to Make You a Wizard in Your Terminal
These days there is a CLI (Command Line Interface) tool for everything, Git, Azure, Databricks, SQL Server to name a few. They're really important for your DevOps pipelines, but using them in the PowerShell or the terminal can seem daunting if it's not familiar.
With a few simple commands and a bit of knowledge, we can tame the terminal and take our developer productivity to a new level. In this talk we'll cover 5 things that help you become a wizard of the terminal, whether you are using Windows, Linux or Mac, PowerShell or Bash.
Come along and learn some neat ways to tame your terminal!
First public delivery
SQLBits 2024 - General Sessions Sessionize Event
Data Relay 2022 Sessionize Event

Niall Langley
Data Engineer / Platform Architect
Bristol, United Kingdom
Links
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top