Data Quality in the Databricks Lakehouse

As modern enterprises adopt the Lakehouse architecture to unify data engineering, analytics, and AI, maintaining high data quality becomes a foundational requirement. In this session, we explore how Databricks enables scalable, automated data quality management using a combination of open-source and native tools.

First, we will uncover Databricks Labs DQX. DQX is a simple validations framework for assesing data quality of PySpark dataframes. The main benefit is as it allows real-time quality validation during data processing rather than post-fact monitoring as Lakehouse Monitoring does. This alllows quicker identification and resolution of data quality problems. As it allows to quarantine invalid data and investigate data quality issues before they actually are written to the table.

Then we will dive into Databricks Lakehouse Monitoring. Databricks Lakehouse Monitoring is a native feature which allows monitor the statistical properties and quality of the data in all of the tables in your account. Monitoring your data provides quantitative measures that help you track and confirm the quality and consistency of your data over time.

Vitalija Bartusevičiūtė

Senior Consultant - Data Engineer

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Data Quality in the Databricks Lakehouse

Vitalija Bartusevičiūtė

Links

Actions