DuckDQ: Data Quality Validation for Machine Learning Pipelines

Till Döhmen (RWTH Aachen University and Fraunhofer FIT)

Data quality validation plays an important role in ensuring the correct behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production ML systems, we developed DuckDQ: A lightweight and efficient Python library for protecting machine learning pipelines from data errors. It integrates well with existing scikit-learn ML pipelines and does not require a distributed computing environment or ML platform infrastructure. DuckDQ’s execution engine was built on top of DuckDB - the in-process OLAP database management system developed by the CWI Database Architectures group. The talk will give a brief introduction to the problem field and highlight some of the key design choices behind DuckDQ, including why DuckDB was a natural fit.

Till Döhmen is a PhD Student at RWTH Aachen University and Fraunhofer FIT. His research interests lie at the intersection of data management and machine learning systems. Till obtained his Masters in Artificial Intelligence from the VU University, and wrote his Masters Thesis at the Database Architectures Group of CWI, before taking on different industry roles in the area of Data Science.

Slides