Provenance Research in Gray Systems Lab at Microsoft

Fotis Psallidas

Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, provenance is instrumental for a wide range of enterprise applications, including governance, auditing, and observability. As provenance becomes more prevalent across enterprise applications, research and engineering need to work in tandem to define, develop, and optimize provenance functionality. To this end, at Microsoft’s Gray Systems Lab (GSL), we have identified provenance capture, provenance querying, and provenance-aware applications as key domains for provenance research and research engineering. In this talk, I will present selected projects we have been working on in GSL, along with key challenges and lessons learned, per research area. Regarding provenance capture, I will first present OneProvenance, an engine that captures dynamic, coarse-grained provenance from database logs efficiently and effectively; OneProvenance is currently in production in Microsoft Purview—supporting dynamic, coarse-grained provenance extraction from Azure SQL. Furthermore, with the advent of machine learning and data science, provenance has also become important in support of enterprise-grade data science. In this direction, I will then present DSProvenance, an engine that can capture both static and dynamic provenance from data science pipelines. Regarding provenance querying, a main problem end-users currently face is that programming interfaces on top of data catalogs are hard to use and lead to hard-to-optimize implementations. To this end, I will present our recent work on PurviewQL—a SQL-based frontend for reading and writing provenance and metadata on top of data catalogs. Finally, to highlight the importance of provenance across application domains, I will provide a brief overview of provenance-aware projects we work on in GSL, including query optimization, job scheduling, semantic type inference, code synthesis, and data quality.

Fotis joined Microsoft as an RSDE in the Gray Systems Lab (GSL) in Jan. 2019, with a focus on the intersection of data management, provenance, instrumentation, data science, and programming languages. In 2019, he also received his Ph.D. degree in Computer Science from Columbia University. From Columbia, he received further the degrees of M.S. and M.Phil. in 2014 and 2017, respectively. In 2011, Fotis received his B.S. degree with Honors from the Department of Informatics and Telecommunications (DIT) of the National and Kapodistrian University of Athens (NKUA). During the summers of 2014 and 2015, Fotis had joined the Data Management, Exploration, and Mining (DMX) group of Microsoft Research (MSR) as an intern.