Simon Whiteley

Data Platform MVP. Databricks Beacon. Cloud Architect, Nerd

London, United Kingdom

Actions

CTO for Advancing Analytics Ltd, Microsoft Data Platform MVP and Databricks MVP. Simon is a seasoned solution architect & technical lead with well over a decade of Microsoft Analytics experience, who spends an inordinate amount of time running the Advancing Spark YouTube series. A deep techie with a focus on emerging cloud technologies and applying "big data" thinking to traditional analytics problems, Simon also has a passion for bringing it back to the high level and making sense of the bigger picture. When not tinkering with tech, Simon is a death-dodging London cyclist, a sampler of craft beers, an avid chef and a generally nerdy person.

Area of Expertise

Information & Communications Technology

Topics

Databricks
Data Engineering
Data Analytics
Big Data
Spark

What's wrong with the Medallion Architecture?

In recent years, companies have seen an explosion in adopting lakehouses - with every analytics developer suddenly rebranding themselves as a Lakehouse Export... but time and time again, we hear from organisations that they regret the layering of their lake, and once it's in, its difficult to change!

Maybe the zones don't quite fit what they were trying to achieve, or no one in the company understands what "silver" vs. "gold" actually means, maybe they had to go back and tack in new layers so we expand into "Diamond", "Platinum" and..."Tin"? We need to tackle a key question: Is the Medallion Architecture right for most businesses - and how should you interpret the advice?

In this session, we'll break down the different stages of data curation and talk about how it works in reality, calling on practical examples from many, many real-world implementations. We'll talk about schema evolution, data cleansing, record validation, and traditional data modelling techniques, layering them on top of our lakehouse zones so we truly understand what happens where.

This session is ideal for data architects, engineers, and analysts looking to design the best platform possible, backed by nearly a decade of Lakehouse development, not a few months on a public preview

The Spark & Delta Ask Me Anything

There's a whole lot going on in the world of Spark, with competing technologies, platforms and approaches making it really hard to keep up. Add to that the massive shift that Delta brought about and it's very tricky indeed!

The Advancing Analytics "Advancing Spark" Youtube channel has been tackling the nerdy lake-loving topics for several years now, and the team are here to answer your questions Live In Person!

We'll be collecting a list of questions from across social media beforehand, but also taking questions from the audience. If you have a burning question about Spark, Delta, Lakehouses or Databricks - this is the session for you!

Behind the Hype - Architecture Trends in Data

As an industry, us data folks love a buzzword. We see huge swings between centralisation and decentralisation, we debate ferociously between warehousing methodologies from decades past and yet we're fairly susceptible to the latest and greatest thing taking the industry by storm. So what's the latest buzz?

In this session, seasoned data engineer and youtube grumbler Simon Whiteley takes us on a journey through the current industry trends and buzzwords, carving through the hype to get at the underlying ideals. Which is going to last and which is a sales gimmick? Which bandwagon might actually take you in the right strategic direction? This session aims to answer these questions with a heady mix of option & optimism!

Building a Lakehouse on the Microsoft Intelligent Data Platform

The Data Lakehouse - it's an emerging architecture being driven by the spark community, but it has a real place within the Microsoft ecosystem. But where do you start? What makes it different? How does it work within Synapse Analytics specifically?

This session session aims to give you that context. We'll look at how spark-based engines work and how we can use them within Synapse Analytics. We'll dig into Delta, the underlying file format that enables the Lakehouse, and take a tour of how the Synapse compute engines interact with it. Finally, we'll draw out our whole data architecture, understanding how the Lakehouse serves our whole data community.

Building The Next Delta Lakehouse

We've been building data lakes for years, and we've seen how the Delta lake file format brought stability & governance to the data lake, but with the Databricks Delta Engine, it's grown into an analytics powerhouse.

This session dives into those new features, showing how data engineering has been transformed and simplified. We'll see how to apply new technologies to old techniques, how best to get your data to your users and work through how to apply delta to your own lake-based scenarios.

Delta Live Tables - The Databricks ETL Framework

There is a lot of complexity in building an engineering framework - When should it run? How are dependencies managed? How does it track data quality & telemetry over time? Databricks have released Delta Live Tables to tackle just this - DLT is a prebuilt framework that allows you to describe sets of tables, in either SQL or Python, then it will build out the rest for you.

In this session, we will run through the core components of DLT, before building out a sample pipeline, complete with data quality measurement, inter-table dependencies and post-run logging. We will look briefly at some more complex topics, managing incremental updates and real-time datasets, before looking at the downsides of a black-box solution.

Lessons in Lakehouse Automation

The term Data Lakehouse is still new to many, but the technology has now reached a level of maturity and sophistication that makes it more accessible than ever before. But where do you start with building a Data Lakehouse? How can you achieve the same level of maturity that we have with relational data warehouses? How do you avoid reinventing the wheel?

In this session, we're going to take you on the journey that data platforms have taken over the past few years. We'll look at the evolution of lakehouse architectures alongside the new techniques for code automation & metadata management that they unlock. We'll talk about some real-world problem scenarios and how you can model them within a reference architecture.

Bringing Data Lakes to your Purview

Data Lakes are a tricky beast, always followed with eyerolling jokes about "data swamps", but how DO you keep your lake under control? Way back in the day we had Azure Data Catalog, which did a decent job at cataloguing relational databases, but was utterly rubbish at anything else. With Azure Purview we have a second shot, a chance to perform true Data Governance over Lake-based platforms.

This session focusses on this use case specially, taking the core elements of Azure Purview and scanning lake data, creating resource sets, plugging in Hive metastores and creating that lake catalog we've always dreamed of, all in 20 mins or less!

feedback link: https://sqlb.it/?7057

Python Pipeline Primer: Data Engineering with Azure DataBricks

Azure DataBricks brings a Platform-as-a-Service offering of Apache Spark, which allows for blazing fast data processing, interactive querying and the hosting of machine learning models all in one place! But most of the buzz is around what it means for Data Science & AI - what about the humble data engineer who wants to harness the in-memory processing power within their ETL pipelines? How does it fit into the Modern Data Warehouse? What does data preparation look like in this new world?

This session will run through the best practices of implementing Azure DataBricks as your data ingestion, transformation and curation tool of choice. We will:

• Introduce the Azure DataBricks service
• Introduce Python and why it is the language of choice for Data Engineering on DataBricks
• Discuss the various hosting & compute options available
• Demonstrate a sample data processing task
• Compare and contrast against alternative approaches using SSIS, U-SQL and HDInsight
• Demonstrate how to manage and orchestrate your processing pipelines
• Review the wider architectures and additional extension patterns

The session is aimed at Data Engineers & BI Professionals seeking to put the Azure DataBricks technology in the right context and learn how to use the service. We will not be covering the python programming language in detail.

Advanced Data Factory: Let the Data Flow!

Modern Azure Data Factories are lean, efficient data pipelines, fully parameterised, dynamic and hugely scalable - this session will show you how to achieve this and more!

Many people still miss the "good old days" of SSIS, and it's now possible to run SSIS within Data Factory, but that doesn't resolve the scalability and automation problems it faces. But never fear, new ADF Data Flows provide a slick, GUI interface over an incredibly scalable, cloud-first compute backend.

This session will run through the new features in ADFV2 and discuss how they can be used to streamline your factories, putting them in the context of real-world solutions. We will also look at the new compute options provided by ADF Data Flows, review how it interacts with Azure Databricks and set you up for truly cloud-native ETL!

Azure Databricks: Engineering Vs Data Science

Have you looked at Azure DataBricks yet? No! Then you need to. Why you ask, there are many reasons. The number 1, knowing how to use apache Spark will earn you more money. It is that simple. Data Engineers and Data Scientists who know apache Spark are in-demand! This workshop is designed to introduce you to the skills required to do both.

In the morning we will introduce Azure DataBricks then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python (Or possibly some other stuff we are not allowed to tell you about yet). The Engineering element will be delivered by UK MVP Simon Whiteley. Simon has been deploying engineering projects with Azure DataBricks since it was announced. He has real world experience in multiple environments.

Then we will shift gears, we will take the data we moved and cleansed and apply distributed machine learning at scale. We will train a model and productionise it. We will then enrich our data with our newly predicted values. The Data Science element will be led by UK MVP Terry McCann. Terry holds an MSc in Data Science and has been working with apache Spark for the last 5 years. He is dedicated to applying engineering practices to data science to make model development, training and scoring as easy an as automated as possible

By the end of the day, you will understand how Azure Databricks supports both data engineering and data science, levering apache Spark to deliver blisteringly fast data pipelines and distributed machine learning models. Bring your laptop as this will be hands on.

Pre-requisites
An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment. A basic level of Machine Learning would also be beneficial, but not critical.
Laptop Required:Yes

Software: In the session we will be using Azure Databricks. We will have labs and demos that you can follow if you want to. If you do want to then you will need the following: - An Azure Subscription - Money on the Azure Subscription - Enough access on the subscription to make service principals. - Azure Storage explorer- PowerShell
Subscriptions: Azure

Advancing Databricks - Next Level ETL

Azure Databricks has been around for a while now, and Apache Spark since 2012. You've watched a couple of demos, got a brief overview of what Databricks does and you've got a rough idea of where it fits in… but where do you go from there?

This session is that next stop. We'll start by taking a deeper look inside the spark engine, understanding what makes it tick and how it talks to data. We'll then break down some of the key features that come together to build the kind of data processing task that's changing how we think about ETL.

We'll be looking at:
• RDDs
• Schema Inference
• Metadata Management
• Parameterisation using Widgets
• Integration with ADF

If this is your first foray into Spark or Databricks, it'll be a bumpy ride!

Value-Driven BI Development

You know the story - a new business application has been on-boarded, you've been told to get it into the warehouse. You build out a new star schema, work all weekend doing a production deployment and finally, release it to the wider users. Of course, noone is using it and you eventually find out why - whilst the model represents the real world... it has no actual business value.

There's a better way of working, partly driven by technology, partly by agile working practices. I'll take you through our preferred way of working, performing the absolute minimum up-front work before getting the model to the user, then productionising it later (if there's value!). Let me tell you about Value-Driven Development.

30 minute session w/out demos

Getting Started with Delta & The Lakehouse

Data Lakes have been around for an age, but they were often a niche, specialist thing. With huge advances in parquet, the delta format and lakehouse approaches, it's suddenly everyone wants to be lake-based... so how do you catch up?

This session runs through a quick recap of why lakes were so difficult, before digging into the Delta Lake format and all of the features it brings. We'll look at what Delta gives you through spark, as well as through managed platforms such as Microsoft Fabric.

We'll spend some time looking at the more advanced features, how we can achieve incremental merges, transactional consistency, temporal rollbacks, file optimisation and some deep and dirty performance tuning with partitioning, Z-ordering and V-ordering

If you’re planning, currently building, or looking after a Data Lake with Spark currently and want to get to the next level of performance and functionality, this session is for you. Never heard of parquet or delta? You’re about to learn a whole lot more!

Building an Azure Lakehouse in 60 minutes

It's the buzzword of the year - the "Data Lakehouse", that novel dream of having a modern data platform that gives all the functionality of a data warehouse, but with all of the benefits of a data lake, all in one box.

This action packed session uses Azure Databricks as the core data transformation and analytics engine, augmenting it with Data Factory scheduling and Azure Synapse On-Demand as a serving layer before presenting our data in PowerBI.

It is VERY possible to build a lightweight, scalable analytics platform in a very short amount of time, and I'm going to show you how.

Simon Whiteley

Data Platform MVP. Databricks Beacon. Cloud Architect, Nerd

London, United Kingdom

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Simon Whiteley

Actions

Links

Area of Expertise

Topics

Sessions

What's wrong with the Medallion Architecture?

The Spark & Delta Ask Me Anything

Behind the Hype - Architecture Trends in Data

Building a Lakehouse on the Microsoft Intelligent Data Platform

Building The Next Delta Lakehouse

Delta Live Tables - The Databricks ETL Framework

Lessons in Lakehouse Automation

Bringing Data Lakes to your Purview

Python Pipeline Primer: Data Engineering with Azure DataBricks

Advanced Data Factory: Let the Data Flow!

Azure Databricks: Engineering Vs Data Science

Advancing Databricks - Next Level ETL

Value-Driven BI Development

Getting Started with Delta & The Lakehouse

Building an Azure Lakehouse in 60 minutes

Simon Whiteley

Links

Actions