Speaker

Emily Riederer

Emily Riederer

Senior Analytics Manager at Capital One

Chicago, Illinois, United States

Actions

Emily Riederer is a Senior Analytics Manager at Capital One where she leads a team dedicated to the development of sustainable data products (including datamarts, internal analysis toolkits, and self-service business intelligence capabilities) for business partners ranging from entry-level analysts to senior executives. She is particularly passionate about bringing open source tools and mindsets to industry and empowering communities of practice within organizations.

Emily is a contributing author to 97 Things Every Data Engineer Should Know (O’Reilly) and The R Markdown Cookbook (CRC Press), and frequently writes about all things data on here blog at https://emilyriederer.netlify.com/. In her spare time, she serves as an editor for rOpenSci and maintains the dbtplyr dbt package and the projmgr and convo R packages.

Area of Expertise

  • Information & Communications Technology
  • Media & Information

Topics

  • data engineering
  • Data Management
  • Data Analytics
  • womxn in machine learning and data science
  • women in data science
  • women in machine learning and data science
  • Data Science
  • R Programming
  • Reproducibility
  • Code Quality
  • dbt
  • SQL
  • python
  • Data Quality

Better data testing with the data (error) generating process

Statisticians often approach probabilistic modeling by first understanding the conceptual data generating process. However, when validating messy real-world data, the technical aspects of the data generating process is largely ignored.

In this talk, I will argue the case for developing more semantically meaningful and well-curated data tests by incorporating both conceptual and technical aspects of "how the data gets made".

To illustrate these concepts, we will explore the NYC subway rides open dataset to see how the simple act of reasoning about real-world events their collection through ETL processes can help craft far more sensitive and expressive data quality checks. I will also illustrate instrumenting such checks based on new features in the dbt-utils package (pending approval of a PR that I recently authored).

Audience members should leave this talk with a clear framework in mind for ideating better tests for their own pipelines.

Prior work inspiring this post come from past blog posts on grouped data checks (https://www.emilyriederer.com/post/grouping-data-quality/), common causes of error in ETL pipelines (https://www.emilyriederer.com/post/data-error-gen/), and in-review PR to dbt-utils (to be reviewed and, per initial communications with dbt team, approved before this conference).

Operationalizing Column-Name Contracts with dbtplyr

Complex software systems make performance guarantees through documentation and unit tests, and they communicate these to users with conscientious interface design. However, published data tables exist in a gray area; they are static enough not to be considered a “service” or “software”, yet too raw to earn attentive user interface design. This ambiguity creates a disconnect between data producers and consumers and poses a risk for analytical correctness and reproducibility.

In this talk, I will explain how controlled vocabularies can be used to form contracts between data producers and data consumers. Explicitly embedding meaning in each component of variable names is a low-tech and low-friction approach which builds a shared understanding of how each field in the dataset is intended to work.

Doing so can offload the burden of data producers by facilitating automated data validation and metadata management. At the same time, data consumers benefit by a reduction in the cognitive load to remember names, a deeper understanding of variable encoding, and opportunities to more efficiently analyze the resulting dataset.

After discussing the theory of controlled vocabulary column-naming and related workflows, I will illustrate these ideas with a demonstration of the {dbtplyr} dbt package which helps analytics engineers get the most value from controlled vocabularies by making it easier to effectively exploit column naming structures while coding.

Emily Riederer

Senior Analytics Manager at Capital One

Chicago, Illinois, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top