Session

Data Linkage Options: MDM vs Splink (PySpark)

The problem of linking multiple disconnected datasets is a challenging one. It arises when there is no common ID field available to join the different datasets. Instead, other less precise methods must be used such as fuzzy matching across multiple fields.

In this session we will provide an overview comparing options for linking data, including out-of-the-box options (e.g. MDM), code-heavy options (e.g. Spark) and other solutions in between. We will then introduce some of the theory used in probabilistic data linkage models and their differences from deterministic models. For our examples we will use Splink, an open-source Python package developed by the Ministry of Justice in the UK to implement fast probabilistic record linkage and deduplication at scale.

You might want to check out our in-depth session on splink, which you can find on youtube here:
https://www.youtube.com/watch?v=1ijNR3V4v3w

Barney Lawrence

Consultant focusing on data analytics and engineering on the Microsoft platform

Chesterfield, United Kingdom

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top