Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this talk, I will describe mlinspect, a library that helps with tasks like diagnosing and mitigating technical bias that may arise during preprocessing steps in an ML pipeline. The key idea is to extract a directed acyclic graph representation of the dataflow from a ML pipeline, and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. I will discuss the design and implementation of the mlinspect library, and discuss performance aspects.
I am a second year PhD student at the University of Amsterdam, conducting research at the intersection of data management and machine learning. I obtained my master’s in software engineering from TU Munich and wrote my master thesis with Julia Stoyanovich from NYU and Sebastian Schelter from UvA. In the past, I interned at Amazon Research and Oracle Labs, and worked as a research assistant for the Database Group at TU Munich. At these places, I worked on Deequ, PGX, and Umbra.