Data Science through the Looking Glass and what we found there

Bojan Karlas

The recent success of machine learning (ML) has led to an explosive growth of new systems and methods for supporting the ML development process. This process entails both the data science projects for performing predictive analytics, but also the process of developing learned components that are part of broader software systems. However, this explosive growth poses a real challenge for system builders in their effort to develop the most up-to-date tools that are well integrated with the rest of the ML software stack. A lot of new development is driven by direct customer feedback and anecdotal evidence. Even though this is indeed a valuable source of information, given that we are usually able to survey only a handful of sources, we are often likely to end up with a biased or incomplete picture. To support the next generation of ML development systems, we require certain insights that can only be drawn from a larger scale empirical analysis of the methods and habits of ML practitioners. In this work, we set out to capture this panorama through a wide-angle lens, by performing the largest analysis of data science projects to date. Specifically, we analyze: (a) over 6M Python notebooks publicly available on GitHub, (b) over 2M enterprise DS pipelines developed within Microsoft, and (c) the source code and metadata of over 900 releases from 12 important DS libraries. The analysis we perform ranges from coarse-grained statistical characterizations to analysis of library imports, pipelines, including comparative studies across datasets and time. In this talk, we will go over the findings that we were able to gather from this extensive study. Furthermore, we will cover some key insights and takeaways which could be useful at supporting the design decisions for building next generation ML development systems.

Bojan is a recently graduated PhD from the Systems Group of ETH Zurich advised by Prof. Ce Zhang. His research focuses on discovering systematic methods for managing the machine learning development process. Specifically, he focuses on data debugging, which is the process of performing targeted data quality improvements in order to improve the quality of end-to-end machine learning pipelines. He has done internships at Microsoft, Oracle and Logitech. Before ETH, he obtained his master’s degree at EPFL in Lausanne and his bachelor’s at the University of Belgrade. The coming fall, he will be joining the group of Prof. Kun-Hsing Yu at Harvard Medical School to work on novel methods for debugging ML workflows used in biomedical applications.