Simon Whiteley
Data Platform MVP. Databricks Beacon. Cloud Architect, Nerd
London, United Kingdom
Actions
Director of Engineering for Advancing Analytics Ltd, Microsoft Data Platform MVP and Databricks Beacon. Simon is a seasoned solution architect & technical lead with well over a decade of Microsoft Analytics experience, who spends an inordinate amount of time running the Advancing Spark YouTube series. A deep techie with a focus on emerging cloud technologies and applying "big data" thinking to traditional analytics problems, Simon also has a passion for bringing it back to the high level and making sense of the bigger picture. When not tinkering with tech, Simon is a death-dodging London cyclist, a sampler of craft beers, an avid chef and a generally nerdy person.
Links
Area of Expertise
Building a Lakehouse on the Microsoft Intelligent Data Platform
The Data Lakehouse - it's an emerging architecture being driven by the spark community, but it has a real place within the Microsoft ecosystem. But where do you start? What makes it different? How does it work within Synapse Analytics specifically?
This session session aims to give you that context. We'll look at how spark-based engines work and how we can use them within Synapse Analytics. We'll dig into Delta, the underlying file format that enables the Lakehouse, and take a tour of how the Synapse compute engines interact with it. Finally, we'll draw out our whole data architecture, understanding how the Lakehouse serves our whole data community.
Building The Next Delta Lakehouse
We've been building data lakes for years, and we've seen how the Delta lake file format brought stability & governance to the data lake, but with the Databricks Delta Engine, it's grown into an analytics powerhouse.
This session dives into those new features, showing how data engineering has been transformed and simplified. We'll see how to apply new technologies to old techniques, how best to get your data to your users and work through how to apply delta to your own lake-based scenarios.
Delta Live Tables - The Databricks ETL Framework
There is a lot of complexity in building an engineering framework - When should it run? How are dependencies managed? How does it track data quality & telemetry over time? Databricks have released Delta Live Tables to tackle just this - DLT is a prebuilt framework that allows you to describe sets of tables, in either SQL or Python, then it will build out the rest for you.
In this session, we will run through the core components of DLT, before building out a sample pipeline, complete with data quality measurement, inter-table dependencies and post-run logging. We will look briefly at some more complex topics, managing incremental updates and real-time datasets, before looking at the downsides of a black-box solution.
Lessons in Lakehouse Automation
The term Data Lakehouse is still new to many, but the technology has now reached a level of maturity and sophistication that makes it more accessible than ever before. But where do you start with building a Data Lakehouse? How can you achieve the same level of maturity that we have with relational data warehouses? How do you avoid reinventing the wheel?
In this session, we're going to take you on the journey that data platforms have taken over the past few years. We'll look at the evolution of lakehouse architectures alongside the new techniques for code automation & metadata management that they unlock. We'll talk about some real-world problem scenarios and how you can model them within a reference architecture.
Bringing Data Lakes to your Purview
Data Lakes are a tricky beast, always followed with eyerolling jokes about "data swamps", but how DO you keep your lake under control? Way back in the day we had Azure Data Catalog, which did a decent job at cataloguing relational databases, but was utterly rubbish at anything else. With Azure Purview we have a second shot, a chance to perform true Data Governance over Lake-based platforms.
This session focusses on this use case specially, taking the core elements of Azure Purview and scanning lake data, creating resource sets, plugging in Hive metastores and creating that lake catalog we've always dreamed of, all in 20 mins or less!
feedback link: https://sqlb.it/?7057
Python Pipeline Primer: Data Engineering with Azure DataBricks
Azure DataBricks brings a Platform-as-a-Service offering of Apache Spark, which allows for blazing fast data processing, interactive querying and the hosting of machine learning models all in one place! But most of the buzz is around what it means for Data Science & AI - what about the humble data engineer who wants to harness the in-memory processing power within their ETL pipelines? How does it fit into the Modern Data Warehouse? What does data preparation look like in this new world?
This session will run through the best practices of implementing Azure DataBricks as your data ingestion, transformation and curation tool of choice. We will:
• Introduce the Azure DataBricks service
• Introduce Python and why it is the language of choice for Data Engineering on DataBricks
• Discuss the various hosting & compute options available
• Demonstrate a sample data processing task
• Compare and contrast against alternative approaches using SSIS, U-SQL and HDInsight
• Demonstrate how to manage and orchestrate your processing pipelines
• Review the wider architectures and additional extension patterns
The session is aimed at Data Engineers & BI Professionals seeking to put the Azure DataBricks technology in the right context and learn how to use the service. We will not be covering the python programming language in detail.
Advanced Data Factory: Let the Data Flow!
Modern Azure Data Factories are lean, efficient data pipelines, fully parameterised, dynamic and hugely scalable - this session will show you how to achieve this and more!
Many people still miss the "good old days" of SSIS, and it's now possible to run SSIS within Data Factory, but that doesn't resolve the scalability and automation problems it faces. But never fear, new ADF Data Flows provide a slick, GUI interface over an incredibly scalable, cloud-first compute backend.
This session will run through the new features in ADFV2 and discuss how they can be used to streamline your factories, putting them in the context of real-world solutions. We will also look at the new compute options provided by ADF Data Flows, review how it interacts with Azure Databricks and set you up for truly cloud-native ETL!
Azure Databricks: Engineering Vs Data Science
Have you looked at Azure DataBricks yet? No! Then you need to. Why you ask, there are many reasons. The number 1, knowing how to use apache Spark will earn you more money. It is that simple. Data Engineers and Data Scientists who know apache Spark are in-demand! This workshop is designed to introduce you to the skills required to do both.
In the morning we will introduce Azure DataBricks then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python (Or possibly some other stuff we are not allowed to tell you about yet). The Engineering element will be delivered by UK MVP Simon Whiteley. Simon has been deploying engineering projects with Azure DataBricks since it was announced. He has real world experience in multiple environments.
Then we will shift gears, we will take the data we moved and cleansed and apply distributed machine learning at scale. We will train a model and productionise it. We will then enrich our data with our newly predicted values. The Data Science element will be led by UK MVP Terry McCann. Terry holds an MSc in Data Science and has been working with apache Spark for the last 5 years. He is dedicated to applying engineering practices to data science to make model development, training and scoring as easy an as automated as possible
By the end of the day, you will understand how Azure Databricks supports both data engineering and data science, levering apache Spark to deliver blisteringly fast data pipelines and distributed machine learning models. Bring your laptop as this will be hands on.
Pre-requisites
An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment. A basic level of Machine Learning would also be beneficial, but not critical.
Laptop Required:Yes
Software: In the session we will be using Azure Databricks. We will have labs and demos that you can follow if you want to. If you do want to then you will need the following: - An Azure Subscription - Money on the Azure Subscription - Enough access on the subscription to make service principals. - Azure Storage explorer- PowerShell
Subscriptions: Azure
Advancing Databricks - Next Level ETL
Azure Databricks has been around for a while now, and Apache Spark since 2012. You've watched a couple of demos, got a brief overview of what Databricks does and you've got a rough idea of where it fits in… but where do you go from there?
This session is that next stop. We'll start by taking a deeper look inside the spark engine, understanding what makes it tick and how it talks to data. We'll then break down some of the key features that come together to build the kind of data processing task that's changing how we think about ETL.
We'll be looking at:
• RDDs
• Schema Inference
• Metadata Management
• Parameterisation using Widgets
• Integration with ADF
If this is your first foray into Spark or Databricks, it'll be a bumpy ride!
Value-Driven BI Development
You know the story - a new business application has been on-boarded, you've been told to get it into the warehouse. You build out a new star schema, work all weekend doing a production deployment and finally, release it to the wider users. Of course, noone is using it and you eventually find out why - whilst the model represents the real world... it has no actual business value.
There's a better way of working, partly driven by technology, partly by agile working practices. I'll take you through our preferred way of working, performing the absolute minimum up-front work before getting the model to the user, then productionising it later (if there's value!). Let me tell you about Value-Driven Development.
30 minute session w/out demos
Getting Started with Delta & The Lakehouse
Data Lakes have been around for an age, but they were often a niche, specialist thing. With huge advances in parquet, the delta format and lakehouse approaches, it's suddenly everyone wants to be lake-based... so how do you catch up?
This session runs through a quick recap of why lakes were so difficult, before digging into the Delta Lake format and all of the features it brings. We'll look at what Delta gives you through spark, as well as through managed platforms such as Microsoft Fabric.
We'll spend some time looking at the more advanced features, how we can achieve incremental merges, transactional consistency, temporal rollbacks, file optimisation and some deep and dirty performance tuning with partitioning, Z-ordering and V-ordering
If you’re planning, currently building, or looking after a Data Lake with Spark currently and want to get to the next level of performance and functionality, this session is for you. Never heard of parquet or delta? You’re about to learn a whole lot more!
Building an Azure Lakehouse in 60 minutes
It's the buzzword of the year - the "Data Lakehouse", that novel dream of having a modern data platform that gives all the functionality of a data warehouse, but with all of the benefits of a data lake, all in one box.
This action packed session uses Azure Databricks as the core data transformation and analytics engine, augmenting it with Data Factory scheduling and Azure Synapse On-Demand as a serving layer before presenting our data in PowerBI.
It is VERY possible to build a lightweight, scalable analytics platform in a very short amount of time, and I'm going to show you how.
Databricks & Microsoft Fabric: It's Seamless
In the world of Azure where everything was separate cloud components, it was easy to mix and match tools and resources to build your own platform. But with Microsoft Fabric, we have a single platform, so how do we bring our own tools to the party? How do we use all of the incredible engineering, machine learning and AI power inside Databricks, but woven seamlessly into our Fabric environment? How do we structure our Lakehouses? How do we think about security? Where does Unity Catalog come in? Lots of questions that need some thought!
In this session we'll run through the Databricks and Fabric platforms and compare the areas that overlap and where our various analytical personas would work. We'll then look at a reference architecture and go through the steps to bring the two together into a single approach, before contrasting some alternative patterns.
Simon Whiteley
Data Platform MVP. Databricks Beacon. Cloud Architect, Nerd
London, United Kingdom
Links
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top