Upcoming talks

10th Seminar

July 2, 2021, 4:00PM-5:30PM (CET)

The 10th seminar of DSDSD will feature talks by
Arun Kumar (UCSD)
Till Döhmen (RWTH Aachen University and Fraunhofer FIT) .

Multi-Query Optimizations for Deep Learning Systems

Arun Kumar (UCSD)

Deep learning (DL) is growing in popularity for many advanced data analytics applications in enterprise, Web, scientific, and other domains. Naturally, resource efficiency of DL systems and the productivity of their users are pressing challenges to democratizing DL. In this talk, I present a new technical direction from my research that tackles such challenges with a database-inspired lens: higher-level specification and multi-query optimization (MQO). By exploiting higher-level abstractions of DL usage already inherent in practice, I show how we can automatically restructure the underlying execution to improve resource efficiency, reduce runtimes and costs, and in turn, improve user productivity. To this end, we marry fundamental computational and mathematical properties of DL methods and stochastic gradient descent with careful data system design and implementation.

Our approach benefits both DL inference and training, as I illustrate with three recent systems: Vista, with MQO for CNN transfer learning; Krypton, with MQO for CNN inference; and Cerebro, with MQO for parallel DL model selection. All of our techniques are easily integrated with existing DL systems (e.g., TensorFlow and PyTorch) without affecting their internal code, making practical adoption easier. I will conclude by highlighting some of our ongoing and upcoming work on generalizing Cerebro to more higher-level DL tasks and to more execution environments such as cloud-native settings.

Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering and the Halicioglu Data Science Institute at the University of California, San Diego. He is a member of the Database Lab and Center for Networked Systems and an affiliate member of the AI Group. His primary research interests are in data management and systems for machine learning/artificial intelligence-based data analytics. Systems and ideas based on his research have been released as part of the Apache MADlib open-source library, shipped as part of products from Cloudera, IBM, Oracle, and Pivotal, and used internally by Facebook, Google, LogicBlox, Microsoft, and other companies. He is a recipient of two SIGMOD research paper awards, a SIGMOD Research Highlight Award, three distinguished reviewer awards from SIGMOD/VLDB, the PhD dissertation award from UW-Madison CS, the IEEE TCDE Rising Star Award, an NSF CAREER Award, a Hellman Fellowship, a UCSD oSTEM Faculty of the Year Award, and research award gifts from Amazon, Google, Oracle, and VMware.

DuckDQ: Data Quality Validation for Machine Learning Pipelines

Till Döhmen (RWTH Aachen University and Fraunhofer FIT)

Data quality validation plays an important role in ensuring the correct behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production ML systems, we developed DuckDQ: A lightweight and efficient Python library for protecting machine learning pipelines from data errors. It integrates well with existing scikit-learn ML pipelines and does not require a distributed computing environment or ML platform infrastructure. DuckDQ’s execution engine was built on top of DuckDB - the in-process OLAP database management system developed by the CWI Database Architectures group. The talk will give a brief introduction to the problem field and highlight some of the key design choices behind DuckDQ, including why DuckDB was a natural fit.

Till Döhmen is a PhD Student at RWTH Aachen University and Fraunhofer FIT. His research interests lie at the intersection of data management and machine learning systems. Till obtained his Masters in Artificial Intelligence from the VU University, and wrote his Masters Thesis at the Database Architectures Group of CWI, before taking on different industry roles in the area of Data Science.