DSDSD - Past Talks

Past talks

Oct 04, 2024

Peaceful Sharing while Training Models

Pinar Tözün

Deep learning training is an expensive process that extensively uses GPUs. However, not all model training saturates the resources of a single GPU. This problem gets exacerbated with each new GPU generation offering more hardware resources. In this talk, we will first investigate methods to share GPU resources across model training jobs by collocating these jobs on the same GPU to improve hardware utilization. Then, we will explore work sharing opportunities in the data pipelines of model training, furthering the benefits of collocated training.

Pınar Tözün is an Associate Professor at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on resource-aware machine learning, performance characterization of data-intensive systems, and scalability and efficiency of data-intensive systems on modern hardware.

Oct 04, 2024

Data Processing on heterogeneous hardware

Gustavo Alonso (ETH Zurich)

Computing platforms are evolving rapidly along many dimensions: processors, specialization, disaggregation, acceleration, smart memory and storage, etc. Many of these developments are being driven by data science but also arise from the need to make cloud computing more efficient. From a practical perspective, the result we see today is a deluge of possible configurations and deployment options, most of them too new to have a precise idea of their performance implications and lacking proper support in the form of tools and platforms that can manage the underlying diversity. The growing heterogeneity is opening up many opportunities but also raising significant challenges. In the talk I will describe the trend towards specialization at all layers of the architecture, the possibilities it opens up, and demonstrate with real examples how to take advantage of heterogeneous computing platforms. I will also discuss a system we are building for data processing considering heterogeneity both on the software as well as on the hardware side.

Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group (www.systems.ethz.ch) and the head of the Institute of Computing Platforms. He leads the AMD HACC (Heterogeneous Accelerated Compute Cluster) deployment at ETH (https://github.com/fpgasystems/hacc), with several hundred users worldwide, a research facility that supports exploring data center hardware-software co-design. His research interests include data management, cloud computing architecture, and building systems on modern hardware. Gustavo holds degrees in telecommunication from the Madrid Technical University and a MS and PhD in Computer Science from UC Santa Barbara. Previous to joining ETH, he was a research scientist at IBM Almaden in San Jose, California. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).

Mar 22, 2024

Efficient CSV Parsing - On the Complexity of Simple Things

Pedro Holanda

In this talk, we will revisit different CSV parsing implementations in DuckDB and compare them with the current implementation. The bulk of the talk is to discuss the design and implementation decisions in DuckDB’s current CSV Parser. In particular, we will examine the parallel algorithm, the CSV buffer manager, and the transitions of the CSV state machine. Disclaimer: This talk is not for the faint of heart; some very exotically built CSV files will be depicted.

Pedro is an early contributor to DuckDB and currently works as a software engineer at DuckDB Labs, focusing on core and integration aspects of DBMS technology. He completed his PhD at the Database Architectures group at CWI, researching Indexes for Interactive Data Analysis.

Feb 23, 2024

Towards LLM-augmented Database Systems

Carsten Binnig

Recent LLMs such as GPT-4-turbo can answer user queries over multi-model data including tables and thus seem to be able to even replace the role of databases in decision-making in the future. However, LLMs have severe limitations since query answering with LLMs not only has problems such as hallucinations but also causes high-performance overheads even for small data sets. In this talk, I suggest a different direction where we use database technology as a starting point and extend it with LLMs where needed for answering user queries over multi-model data. This not only allows us to tackle problems such as the performance overheads of pure LLM-based approaches for multi-modal question-answering but also opens up other opportunities for database systems.

Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and a Visiting Researcher at the Google Systems Research Group. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data systems on modern hardware as well as machine learning for scalable data systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards.

Nov 24, 2023

C3 - Compressing Correlated Columns

Thomas Glas

Open file formats typically use a set of lightweight compression schemes to compress individual columns, taking advantage of data patterns found within the values of each column. However, by compressing columns separately, we do not consider correlations that may exist between columns that may allow us to compress more effectively. Real-world datasets exhibit many such column correlations and research how they can be exploited for compression. In this talk, we introduce C3 (Compressing Correlated Columns), a new compression framework which can exploit correlations between columns for compression. We designed C3 on top of typical lightweight compression infrastructure, and added six new multi-column compression schemes which exploit correlations. We designed our multi-column compression schemes based on correlations we found in real-world datasets, but new compression schemes exploiting other types of correlations can easily be added. C3 uses a sampling-based algorithm to choose the most effective scheme to compress each column. We evaluated the effectiveness of C3 on the Public BI benchmark, containing real-world datasets, and achieved around 20% higher compression ratios compared to using only typical single-column compression schemes.

Thomas Glas is pursuing his master’s degree in computer science at the Technical University of Munich. He joined the Database Architecture Group at CWI in May this year to write his master’s thesis on columnar data compression.

Nov 24, 2023

Lambda functions in the duck's nest

Tania Bogatsch

Many SQL databases do not focus on efficient LIST-type support. Scalar functions and aggregations on LIST values often require additional unnesting steps or loading normalized data. However, nested input formats such as JSON are widespread in analytics. Efficient operations directly on these input formats can leverage the potential of SQL engines while increasing the system’s ease of use. However, using this potential is not trivial, as the LIST type’s underlying storage format and operations have to synergize with the relational execution model. DuckDB is a high-performance relational database system for analytics. In this talk, I’ll showcase DuckDB’s internal design choices to support LISTs efficiently and highlight our support of Python-style list comprehension directly in the SQL dialect.

I studied Computer Science from 2016 to 2022 in Ilmenau, Germany. After my Bachelor’s, I got the opportunity for a four-month internship at the CWI in Amsterdam, where I worked on adaptive expression reordering in DuckDB. In 2022, after finishing my studies, I returned to Amsterdam to work for DuckDB Labs as a software engineer.

Sep 27, 2023

Query Processing on Heterogeneous Systems

Viktor Rosenfeld

Today’s computing systems are highly heterogeneous, both in terms of the hardware that they are built from and the software that they run. This heterogeneity is both a benefit and a challenge. On the one hand, specialized processors provide the performance necessary to process increasing amounts of data, and specialized software system enable programmers with different needs and expertise to extract knowledge from data. On the other hand, the need to integrate heterogeneous hardware and software into a cohesive whole increases the complexity of our computing infrastructure.

In this talk, we investigate how heterogeneous hardware and software impacts query processing, and develop tools to manage this heterogeneity. First, we present a survey of query processing systems that target both CPUs and GPUs. Second, we investigate how sensitive CPUs and GPUs are to operator implementation details and describe how a query processing system can adapt its low-level operator implementation to the processor it runs on. Third, we describe how to execute Java-based user-defined functions inside a query processing engine written in C++. Our investigation shows that we often face similar challenges when integrating heterogeneous hardware and software.

Viktor Rosenfeld is a PhD student at the Database Systems and Information Management group at Technische Universität Berlin, supervised by Prof. Volker Markl. His research interests include optimizing query execution on modern heterogeneous processors.

Jul 07, 2023

LingoDB - an open compilation and optimization framework sustainable data processing

Michael Jungmair (TUM)

There is a pressing need to execute an increasingly diverse set of workloads efficiently on modern hardware. Wouldn’t it be nice if we could apply powerful cross-domain optimizations and easily adjust to the fast moving target modern hardware is today. Unfortunately, many high-performance engines struggle to embrace this exciting opportunity, often because they are not flexible or extendible enough. Their monolithic internal designs require a high effort in order to support additional workloads transparently or to adapt swiftly to the changes in the underlying hardware. With LingoDB we build an open source, compiling data processing engine to meet these needs and expectations. We break up previously monolithic components such as query plans, query optimizers, and query executers to construct an open and flexible engine that can not only execute complex SQL queries, but also user-defined algorithms and operators and makes cross-domain optimizations feasible. Through code generation and transparent optimizations such as auto-parallelization, LingoDB ensures that even user-defined operators are executed efficiently on modern hardware, while keeping the implementation effort low.

Michael is a second year PhD student at the Technical University of Munich. Supervised by Jana Giceva, he is researching on novel architectures for database engines using compiler technology.

Jul 07, 2023

ALP - Adaptive Lossless floating-Point Compression

Leonardo Kuffó (CWI)

In data science, floating-point data is more prominent than in traditional database scenarios. IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as zstd; while [de]compression is much slower than Delta and FOR. We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio. We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals for doubles that originated as decimals, and otherwise uses vectorized compression of the front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, and an efficient two-stage compression algorithm that first samples row-groups and then vectors.

MSc student at VU Amsterdam & UvA. Currently doing research on data compression at CWI in the Database Architectures research group. Former researcher on opinion mining and social networks analysis at ESPOL University (Ecuador). Former data engineer intern at CERN and Amazon EU.

Jul 06, 2023

Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs

Semih Salihoglu (UWaterloo)

This is a two part talk. The main part of the talk discusses a class of cardinality estimation techniques we refer to as optimistic estimators that store statistics about input relations and small-size joins. These estimators use these statistics in formulas that make independence and uniformity assumptions to make an estimate for a query. We focus on complex join queries and observe that for many queries there are multiple formulas to make an estimate and no obvious choice for which formula to use nor any clear advice from prior literature that have implemented these estimators. We show that these estimators can be modeled as different heuristics that pick bottom-to-top paths in a new framework called cardinality estimation graphs (CEGs). Using the framework, we can describe a suite of possible heuristics to make estimates and empirically evaluate which heuristic performs better on several large query benchmarks. For example, we show that on acyclic queries picking “pessimistic” paths/formulas that produce larger estimates are generally more accurate. We then show that CEGs can also model the novel pessimistic estimators that use linear programs based on worst-case query output bounds. This is done by changing the edge weights in the CEG of optimistic estimators to maximum degree weights. We therefore present a very intuitive and arguably much simpler interpretation for pessimistic estimators than prior work on pessimistic estimators.

In the second and shorter part of the talk, I will briefly discuss the vision of the Kùzu graph database management system that is being developed in my group as a state-of-art system that implements state of the art query processing and storage techniques for managing large graph databases.

Semih Salihoğlu is an Associate Professor and a David R. Cheriton Faculty Fellow at University of Waterloo. His research focuses on developing systems for managing, querying, or doing analytics on graph-structured data. His main on-going systems project is Kùzu, which is a new graph database management system that integrates novel storage, indexing and query processing techniques. He holds a PhD from Stanford University and is a recipient of the VLDB 2018 Best Paper, the VLDB 2022 Best Experiments and Analysis Paper, and a 2023 SIGMOD Research Highlights awards.

Jun 23, 2023

Decoupling Compute and Storage for Stream Processing Systems - Benefits, Limitations, and Insights

Yingjun Wu

Stream processing is an essential part of modern data infrastructure, but building an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has become an effective way to address these challenges.

In this talk, we discuss the benefits and limitations of the decoupled compute and storage architecture in stream processing systems. We find that, while decoupling compute and storage can help achieve infinite scalability, this approach can lead to data consistency and high latency issues, especially when processing complex continuous queries that require managing extra-large internal states. We then present our solution to address the challenges by implementing a tiered storage mechanism. The tiered storage approach utilizes a combination of high-performance and low-cost storage tiers to minimize data movement between the compute and storage layers while maintaining efficient processing. By the end of the talk, we will present experimental results that demonstrate the balance between performance and cost-efficiency achieved by our proposed approach.

Yingjun Wu is the founder of RisingWave Labs (https://www.risingwave.com/), a database company developing RisingWave, a distributed SQL database for stream processing. Before running the company, Yingjun was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the Database group, IBM Almaden Research Center. Yingjun received his PhD degree from National University of Singapore, and was a visiting PhD at Carnegie Mellon University. He has been working in the field of stream processing and database systems for over a decade.

Jun 02, 2023

Shredding deeply nested JSON, one vector at a time

Laurens Kuiper

JSON is a popular semi-structured data format. Despite being semi-structured, users often want to analyze it in a structured way, e.g., by analyzing JSON log files to find out what their users are doing. Analytical database systems would be the tool of choice for this, but these systems often cannot process semi-structured data or the nested data such as OBJECTs and ARRAYs found in JSON. DuckDB, however, supports efficient columnar STRUCT and LIST types and, therefore, supports the same nestedness as JSON. Since 0.7.0, DuckDB supports reading JSON files directly as if they were tables, with automatic schema detection. In this talk, I will explain how DuckDB reads JSON and transforms it into vectors for efficient analytics.

Laurens is a PhD Student at the Database Architectures group at CWI in Amsterdam. He is also a Software Developer at DuckDB Labs. His research interests include OLAP systems, specifically graceful performance degradation when data sizes are larger than memory.

Jun 02, 2023

Implementing InfluxDB IOx, "from scratch" using Apache Arrow, DataFusion, and Rust

Andrew Lamb

It is easier than ever to build new analytic database systems. The trend towards deconstructed databases, high performance interchange standards, and high quality open source components means that cutting edge performance and connectivity is possible without building everything from scratch in a tightly integrated database system. In this talk, we will describe some key technologies such as Apache Arrow, Parquet, DataFusion, and Arrow Flight, and describe how we use them in InfluxData’s new Database system, InfluxDB IOx https://www.influxdata.com/blog/influxdb-engine/.

Andrew Lamb is a Staff Engineer at InfluxData, working on InfluxDB IOx, and a member of the Apache Arrow PMC. His experience ranges from startups such as Vertica to large multinational corporations and distributed open source projects, and has paid leadership dues as an architect and VP. He holds an SB and MEng from MIT in Electrical Engineering and Computer Science.

Apr 14, 2023

Database Schemas in the Wild - What Can We Learn from a Large Corpus of Relational Database Schemas?

Till Döhmen

Tabular data collections, such as GitTables, are important sources of real-world tabular data. They provide training data for table representation learning approaches that advance the state-of-the-art for problems like semantic annotation, data imputation, and automated error detection. However, such datasets are limited to individual tables and do not contain schema information about database constraints (uniqueness, not nulls, etc.) or relationships to other tables. As real-world database schemas are hard to come by - with the largest public repository of databases containing about 150 relational databases - there is a need in the community for a new dataset. Thus, we created GitSchemas, a large corpus of database schema information extracted from SQL scripts in public code repositories, containing highly accurate schema information for more than 150k schemas, 1M tables (including column names, data types, and database constraints), and almost 600k foreign key relationships. We believe that schema information alone (without data) at this scale will be suitable for benchmarking, and improving existing approaches to a variety of relevant data management problems, such as foreign key detection and constraint predictions, while also presenting an opportunity to learn more about how database systems are used in practice.

Till Döhmen is a PhD student at RWTH Aachen University, guest researcher at the UvA Intelligent Data Engineering Lab (INDE Lab), and research engineer at Hopsworks. His research interests lie at the intersection of data management and machine learning systems.

Apr 14, 2023

Provenance Research in Gray Systems Lab at Microsoft

Fotis Psallidas

Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, provenance is instrumental for a wide range of enterprise applications, including governance, auditing, and observability. As provenance becomes more prevalent across enterprise applications, research and engineering need to work in tandem to define, develop, and optimize provenance functionality. To this end, at Microsoft’s Gray Systems Lab (GSL), we have identified provenance capture, provenance querying, and provenance-aware applications as key domains for provenance research and research engineering. In this talk, I will present selected projects we have been working on in GSL, along with key challenges and lessons learned, per research area. Regarding provenance capture, I will first present OneProvenance, an engine that captures dynamic, coarse-grained provenance from database logs efficiently and effectively; OneProvenance is currently in production in Microsoft Purview—supporting dynamic, coarse-grained provenance extraction from Azure SQL. Furthermore, with the advent of machine learning and data science, provenance has also become important in support of enterprise-grade data science. In this direction, I will then present DSProvenance, an engine that can capture both static and dynamic provenance from data science pipelines. Regarding provenance querying, a main problem end-users currently face is that programming interfaces on top of data catalogs are hard to use and lead to hard-to-optimize implementations. To this end, I will present our recent work on PurviewQL—a SQL-based frontend for reading and writing provenance and metadata on top of data catalogs. Finally, to highlight the importance of provenance across application domains, I will provide a brief overview of provenance-aware projects we work on in GSL, including query optimization, job scheduling, semantic type inference, code synthesis, and data quality.

Fotis joined Microsoft as an RSDE in the Gray Systems Lab (GSL) in Jan. 2019, with a focus on the intersection of data management, provenance, instrumentation, data science, and programming languages. In 2019, he also received his Ph.D. degree in Computer Science from Columbia University. From Columbia, he received further the degrees of M.S. and M.Phil. in 2014 and 2017, respectively. In 2011, Fotis received his B.S. degree with Honors from the Department of Informatics and Telecommunications (DIT) of the National and Kapodistrian University of Athens (NKUA). During the summers of 2014 and 2015, Fotis had joined the Data Management, Exploration, and Mining (DMX) group of Microsoft Research (MSR) as an intern.

Mar 24, 2023

Stardog query optimiser - Join ordering and cardinality estimations for graph queries

Pavel Klinov

Stardog is a commercial knowledge graph platform at the heart of which lies a graph database. It manages graph data as RDF and natively implements SPARQL 1.1 graph query language. This talk will briefly present the general architecture of the query engine and then will delve deep into the internals of the query optimiser, particularly, graph statistics and cardinality estimations for graph patterns. It will also briefly discuss reliability of cardinality estimations for different kinds of graph patterns and how it relates to robust query execution.

Differently from some early SPARQL systems Stardog is not built on top of a relational database. Nonetheless the talk will highlight how it takes advantage of many foundational aspects of relational query optimisation, such as rewriting algebraic expressions, cost-based optimisation, planning joins, etc. At the same time some aspects, such as the lack of a rigid schema, like foreign key constraints or column data types, present unique challenges for the query optimiser.

Pavel Klinov has led the query engine team at Stardog since 2011 (with a short academic break in 2012-2015 to work on an automated reasoning project at the University of Ulm). He has overseen Stardog’s query engine evolve from a very simple heuristic optimiser in 2011 to a sophisticated cost-based optimiser in 2023 where most inputs for the cost model come from cardinality estimations. His team is responsible for both query optimisation work and implementing new features for the query language, such as recursive path queries. Prior to joining Stardog he earned his PhD on performance of reasoning algorithms for probabilistic logic at the University of Manchester, UK.

Mar 24, 2023

Efficient detection of multivariate correlations in static and streaming data

Jens d’Hondt

Correlation analysis is an invaluable tool in many domains, for better understanding the data and extracting salient insights. Most works to date focus on detecting high pairwise correlations. A generalization of this problem with known applications but no known efficient solutions involves the discovery of strong multivariate correlations, i.e., finding vectors (typically in the order of 3 to 5 vectors) that exhibit a strong dependence when considered altogether. In this presentation we propose algorithms for detecting multivariate correlations in static and streaming data. Our algorithms, which rely on novel theoretical results, support four different correlation measures, and allow for additional constraints. Our extensive experimental evaluation examines the properties of our solution and demonstrates that our algorithms outperform the state-of-the-art, typically by two orders of magnitude. Check out supporting material at: https://correlationdetective.com/

Jens d’Hondt is a PhD candidate at the Database group of the Eindhoven University of Technology, supervised by dr. Odysseas Papapetrou. He is currently leading the Correlation Detective project, which aims to build a generic system for multivariate similarity search on large datasets. The project started in September 2021, and has since then lead to a publication at VLDB’22 in Sydney.

Mar 14, 2023

Bring Your own Kernel! Constructing High-Performance Data Management Systems from Components

Holger Pirk

Data Management Systems increasingly abandon monolithic architectures in favor of compositions of off-the-shelf components. Storage layers like Parquet and Arrow are combined with kernels like Velox and RocksDB, and optimizers like Calcite. The interfaces between these components are, however, the same as 30 years ago: highly efficient but rigid. This rigidity obstructs the adoption of novel ideas and techniques such as hardware acceleration, adaptive processing, learned optimization or serverless execution in real-world systems.

To address this impasse, I propose a novel approach to database composition inspired by early compiler-construction research: partial query evaluation. Under this paradigm, components communicate using a unified representation for data, code, execution plans and any combination thereof. I present an implementation of the approach in a new system called BOSS and illustrate how BOSS achieves a fully composable design that is effective, elegant and virtually overhead-free.

Holger Pirk is Associate Professor/Senior Lecturer in Computing, at Imperial College London. His research is at the intersection of data management, compilers and computer architecture: targeting new applications like visualization, games, IoT and AI as well as new platforms like compilers, GPUs or FPGAs as well all hardware-conscious algorithms, new data processing paradigms, algebraic optimizations, cost models and code generation techniques. De did his PhD years in the Database Architectures group at CWI in Amsterdam resulting in a PhD from the University of Amsterdam in 2015. Before joining Imperial, he was a Postdoc at the Database group at MIT CSAIL.

Feb 24, 2023

Leveraging Generative AI for Data Processing

Immanuel Trummer

The year 2022 has been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects, aimed at exploiting advanced language processing for tasks such as parsing a database manual to support automated tuning, or mining data for patterns, described in natural language. Finally, I discuss our recent and ongoing research, aimed at synthesizing code for SQL processing in general-purpose programming languages, while enabling customization via natural language commands.

Immanuel Trummer is assistant professor for computer science at Cornell University. His research covers various aspects of large-scale data management with the goal of making data analysis more efficient and more user-friendly. His publications were selected for “Best of VLDB”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. He is a recipient of the Google Faculty Research Award and alumnus of the German National Academic Foundation.

Feb 24, 2023

Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning

David Vos

Data wrangling tasks for data integration and cleaning arise in virtually every data-driven application scenario nowadays. Recent research indicated the astounding potential of Large Language Models (LLMs) for such tasks. The automation of data wrangling with LLMs poses additional challenges, however, as hand-tuning task and data-specific prompts for LLMs requires high expertise and manual effort. On the other hand, finetuning a whole LLM is more amenable to automation, but incurs high storage costs, as a copy of the LLM has to be maintained. In this work, we explore the potential of a lightweight alternative to finetuning an LLM, which automatically learns a continuous prompt. This approach called prefix-tuning does not require updating the original LLM parameters, and can therefore re-use a single LLM instance across tasks. At the same time, it is amenable to automation, as continuous prompts can be automatically learned with standard techniques. We evaluate prefix-tuning on common data wrangling tasks for tabular data such as entity matching, error detection, and data imputation, with promising results. We find that in six out of ten cases, prefix-tuning is within 2.3% of the performance of finetuning, even though it leverages only 0.39% of the parameter updates required for finetuning the full model. These results highlight the potential of prefix-tuning as a parameter-efficient alternative to finetuning for data integration and data cleaning with LLMs.

As a recent MSc. graduate in Artificial Intelligence, I briefly worked as a visiting researcher at the INDELab. My passions include developing advanced data pipelines and conducting research in natural language processing and machine learning. Presently, I am seeking a new ML Engineering opportunity.

Jul 08, 2022

Data Structures on Computational Storage Drives

Ilaria Battiston

This talk will present an introduction to data management systems on Computational Storage Drives, focusing on hardware devices implementing on-the-fly transparent compression. After giving an overview of their functioning and properties, we explore some commonly-used and custom-designed data structures, to understand their related benefits and issues when subject to compression. We then present some ways to optimize and tailor them in order to fully exploit the capabilities of the underneath storage device.

Ilaria is a newly-appointed PhD student in the Database Architecture group at CWI, focusing on OLAP data management techniques.

Jul 08, 2022

Data Science through the Looking Glass and what we found there

Bojan Karlas

The recent success of machine learning (ML) has led to an explosive growth of new systems and methods for supporting the ML development process. This process entails both the data science projects for performing predictive analytics, but also the process of developing learned components that are part of broader software systems. However, this explosive growth poses a real challenge for system builders in their effort to develop the most up-to-date tools that are well integrated with the rest of the ML software stack. A lot of new development is driven by direct customer feedback and anecdotal evidence. Even though this is indeed a valuable source of information, given that we are usually able to survey only a handful of sources, we are often likely to end up with a biased or incomplete picture. To support the next generation of ML development systems, we require certain insights that can only be drawn from a larger scale empirical analysis of the methods and habits of ML practitioners. In this work, we set out to capture this panorama through a wide-angle lens, by performing the largest analysis of data science projects to date. Specifically, we analyze: (a) over 6M Python notebooks publicly available on GitHub, (b) over 2M enterprise DS pipelines developed within Microsoft, and (c) the source code and metadata of over 900 releases from 12 important DS libraries. The analysis we perform ranges from coarse-grained statistical characterizations to analysis of library imports, pipelines, including comparative studies across datasets and time. In this talk, we will go over the findings that we were able to gather from this extensive study. Furthermore, we will cover some key insights and takeaways which could be useful at supporting the design decisions for building next generation ML development systems.

Bojan is a recently graduated PhD from the Systems Group of ETH Zurich advised by Prof. Ce Zhang. His research focuses on discovering systematic methods for managing the machine learning development process. Specifically, he focuses on data debugging, which is the process of performing targeted data quality improvements in order to improve the quality of end-to-end machine learning pipelines. He has done internships at Microsoft, Oracle and Logitech. Before ETH, he obtained his master’s degree at EPFL in Lausanne and his bachelor’s at the University of Belgrade. The coming fall, he will be joining the group of Prof. Kun-Hsing Yu at Harvard Medical School to work on novel methods for debugging ML workflows used in biomedical applications.

Jun 24, 2022

Data Management for Emerging Problems in Large Networks

Arijit Khan

Graphs are widely used in many application domains, including social networks, knowledge graphs, biological networks, software collaboration, geo‐spatial road networks, interactive gaming, among many others. One major challenge for graph querying and mining is that non‐professional users are not familiar with the complex schema and information descriptions. It becomes hard for users to formulate a query (e.g., SPARQL or exact subgraph pattern) that can be properly processed by the existing systems. As an example, Freebase that powers Google’s knowledge graph alone has over 22 million entities and 350 million relationships in about 5428 domains. Before users can query anything meaningful over this data, they are often overwhelmed by the daunting task of attempting to even digest and understand it. Without knowing the exact structure of the data and the semantics of the entity labels and their relationships, can we still query them and obtain the relevant results? In this talk, I shall give an overview of our user‐friendly, embedding‐based, scalable techniques for querying big graphs, including heterogeneous networks. I shall conclude by discussing our newest progress about solving emerging problems on uncertain graphs, graph mining, and machine learning on graphs.

Arijit Khan is an associate professor in the Department of Computer Science, Aalborg University, Denmark. He earned his PhD from the Department of Computer Science, University of California, Santa Barbara, USA, and did a post-doc in the Systems group at ETH Zurich, Switzerland. He has been an assistant professor in the School of Computer Science and Engineering, Nanyang Technological University, Singapore. Arijit is the recipient of the prestigious IBM PhD Fellowship in 2012-13. He published more than 60 papers in premier databases and data mining conferences and journals including ACM SIGMOD, VLDB, IEEE TKDE, IEEE ICDE, SIAM SDM, USENIX ATC, EDBT, The Web Conference (WWW), ACM WSDM, and ACM CIKM. Arijit co-presented tutorials on emerging graph queries and big graph systems at IEEE ICDE 2012, and at VLDB (2017, 2015, and 2014). He served in the program committee of ACM KDD, ACM SIGMOD, VLDB, IEEE ICDE, IEEE ICDM, EDBT, ACM CIKM, and in the senior program committee of WWW. Arijit served as the co-chair of Big-O(Q) workshop co-located with VLDB 2015, wrote a book on uncertain graphs in Morgan & Claypool’s Synthesis Lectures on Data Management. He contributed invited chapters and articles on big graphs querying and mining in the ACM SIGMOD blog, Springer Handbook of Big Data Technologies, and in Springer Encyclopedia of Big Data Technologies. He was invited to give tutorials and talks across 10 countries, including in the National Institute of Informatics(NII) Shonan Meeting on “Graph Database Systems: Bridging Theory, Practice, and Engineering”, 2018, Japan, Asia Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data (APWeb-WAIM 2017), International Conference on Management of Data (COMAD 2016), and in the Dagstuhl Seminar on graph algorithms and systems, 2014 and 2019, Schloss Dagstuhl - Leibniz Center for Informatics, Germany. Dr Khan is serving as an associate editor of IEEE TKDE 2019-now, proceedings chair of EDBT 2020, and IEEE ICDE TKDE poster track co-chair 2023.

May 27, 2022

mlinspect - Lightweight Inspection of Native Machine Learning Pipelines

Stefan Grafberger

Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this talk, I will describe mlinspect, a library that helps with tasks like diagnosing and mitigating technical bias that may arise during preprocessing steps in an ML pipeline. The key idea is to extract a directed acyclic graph representation of the dataflow from a ML pipeline, and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. I will discuss the design and implementation of the mlinspect library, and discuss performance aspects.

I am a second year PhD student at the University of Amsterdam, conducting research at the intersection of data management and machine learning. I obtained my master’s in software engineering from TU Munich and wrote my master thesis with Julia Stoyanovich from NYU and Sebastian Schelter from UvA. In the past, I interned at Amazon Research and Oracle Labs, and worked as a research assistant for the Database Group at TU Munich. At these places, I worked on Deequ, PGX, and Umbra.

May 27, 2022

Building machine learning systems for the era of data-centric AI

Ce Zhang

Recent advances in machine learning systems have made it incredibly easier to train ML models given a training set.
However, this does not mean that the job of an MLDev or MLOps engineer is any easier. As we sail past the era in which the major goal of ML platforms is to support the building of models, we might have to think about our next generation ML platforms as something that support the iteration of data. This is a challenging task, which requires us to take a holistic view of data quality, data management, and machine learning altogether. In this talk, I will discuss some of our thoughts in this space, illustrated by several recent results that we get in data debugging and data cleaning for ML models to systematically enforce their quality and trustworthiness.

Ce is an Assistant Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible—while being cost-efficient and trustworthy—to everyone who wants to use them to make our world a better place. He believes in a system approach to enabling this goal, and his current research focuses on building next-generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable. Before joining ETH, Ce finished his PhD at the University of Wisconsin-Madison and spent another year as a postdoctoral researcher at Stanford, both advised by Christopher Ré. His work has received recognitions such as the SIGMOD Best Paper Award, SIGMOD Research Highlight Award, Google Focused Research Award, an ERC Starting Grant, and has been featured and reported by Science, Nature, the Communications of the ACM, and a various media outlets such as Atlantic, WIRED, Quanta Magazine, etc.

May 13, 2022

Algorithms for Relational Knowledge Graphs

Martin Bravenboer

RelationalAI is the next-generation database system for new intelligent data applications based on relational knowledge graphs. RelationalAI complements the modern data stack by allowing data applications to be implemented relationally and declaratively, leveraging knowledge/semantics for reasoning, graph analytics, relational machine learning, and mathematical optimization workloads. RelationalAI as a relational and cloud native system fits naturally in the modern data stack, providing virtually infinite compute and storage capacity, versioning, and a fully managed system. RelationalAI supports the workload of data applications with an expressive relational language (called Rel), novel join algorithms and JIT compilation suitable for complex computational workloads, semantic optimization that leverages knowledge to optimize application logic, and incrementality of the entire system for both data (IVM) and code (live programming). The system utilizes immutable data structures, versioning, parallelism, distribution, out-of-core memory management to support state-of-the-art workload isolation and scalability for simple as well as complex business logic. In our experience, RelationalAI’s expressive, relational, and declarative language leads to a 10-100x reduction in code for complex business domains. Applications are developed faster, with superior quality by bringing non-technical domain experts into the process and by automating away complex programming tasks. We discuss the core innovations that underpin the RelationalAI system: an expressive relational language, worst-case optimal join algorithms, semantic optimization, just-in-time compilation, schema discovery and evolution, incrementality and immutability.

Martin Bravenboer is VP Engineering at RelationalAI where he leads the development of the RelationalAI system. Before RelationalAI, he was CTO at LogicBlox. As a postdoctoral researcher with Prof. Yannis Smaragdakis, he developed the Doop framework for declarative and precise points-to analysis that uses the LogicBlox system. Martin obtained his PhD at Utrecht University in the area of language design and compiler construction.

May 13, 2022

The LDBC Social Network Benchmark - Business Intelligence workload

Gábor Szárnyas

Graph data management techniques are employed in several domains such as finance and enterprise knowledge representation for evaluating graph pattern matching and path finding queries on large data sets. Supporting such queries efficiently yields a number of unique requirements, including the need for a concise query language and graph-aware query optimization techniques. The goal of the Linked Data Benchmark Council (LDBC) is to design standard benchmarks which capture representative categories of graph data management problems, making the performance of systems comparable and facilitating competition among vendors. This talk describes the Business Intelligence workload, a graph OLAP benchmark with global graph queries that use pattern matching, path finding, and aggregation operations. The workload is executed on a dynamic social network graph updated in daily batches of inserts and deletes. We discuss the design process of the benchmark and present its first stable version.

Gábor Szárnyas is a post-doctoral researcher. He obtained his PhD in software engineering in 2019, focusing on the intersection of object-oriented graph models and property graphs. He currently works on efficient graph processing techniques, including formulating graph algorithms in the language of linear algebra (GraphBLAS), implementing graph query engines (SQL/PGQ), and designing graph benchmarks. He serves on the steering committee of the Linked Data Benchmark Council.

Slides

Apr 29, 2022

Glidesort - Efficient In-Memory Adaptive Stable Sorting on Modern Hardware

Orson Peters

Sorting is one of the most common algorithms used in programming, and virtually every standard library contains a routine for it. Despite also being one of the oldest problems out there, surprisingly large improvements are still being found. Some of these are fundamental novelties, and others are optimizations matching the changing performance landscape in modern hardware.

In this talk we present Glidesort, a general purpose in-memory stable comparison sort. It is fully adaptive to both pre-sorted runs in the data similar to Timsort, and low-cardinality inputs similar to Pattern-defeating Quicksort, making it to our knowledge the first practical stable sorting algorithm fully adaptive in both measures. Glidesort achieves a 3x speedup over a Rust’s standard library Timsort routine on sorting random 32-bit integers, with the speedup breaking the order of magnitude barrier for realistic low-cardinality distributions. It achieves this without the use of SIMD, processor-specific intrinsics or assumptions about the type being sorted: it is a fully generic sort taking an arbitrary comparison operator.

Using Glidesort as the motivating example we discuss the principles of efficient stable in-memory partitioning and merging on modern hardware. In particular attention is paid to eliminating branches and interleaving independent parallel loops to efficiently use our modern deeply-pipelined superscalar processors. The lessons learned here are widely applicable to efficient data processing outside of sorting.

Orson Peters is a first-year PhD student at the Database Architecture group at CWI Amsterdam. His research interests are very broad, and span low-level optimization, compression, information theory, cryptography, (parallel) data structures, string processing and more. In particular sorting is an interest, having published pdqsort in 2015 which is now the default unstable sorting algorithm in Rust and Go. His alma mater is Leiden University, where he did his BSc and MSc in Computer Science, specializing in Artificial Intelligence.

Apr 29, 2022

Taking a Peek under the Hood of Snowflake's Metadata Management

Max Heimel

This talk provides an overview of Snowflake’s architecture that was designed to efficiently support complex analytical workloads in the cloud. Looking at the lifecycle of micro partitions, this talk explains pruning, zero-copy cloning, and instant time travel. Pruning is a technique to speed up query processing by filtering out unnecessary micro partitions during query compilation. Zero-copy cloning allows the creation of logical copies of the data without duplicating physical storage. Instant time travel enables the user to query data “as of” a time in the past, even if the current state of the data has changed. We also describe how we utilize cloud resources to automatically reorganize (“cluster”) micro partitions in the background in order to achieve consistent query performance without affecting running customer queries.

Max Heimel holds a PhD in Computer Science from the Database and Information Management Group at TU Berlin. He joined Snowflake in 2015 and is working as a Software Engineer in the areas of query execution and query optimization. Before joining Snowflake, Max worked at IBM and spent several internships at Google.

Mar 18, 2022

Parallel Grouped Aggregation in DuckDB

Hannes Mühleisen

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. The main issue when computing grouping results is that the groups can occur in the input table in any order, making it difficult to efficiently match grouping keys. Of course, the input can be sorted, but this is computationally expensive. Building a hash table has a lower computational complexity than sorting and is therefore generally preferred, but this requires collision handling. How does parallelism work together with hash tables? In general, the answer is unfortunately: “Badly”. Hash tables are delicate structures that do not handle parallel modifications well. In this talk we present DuckDB’s highly optimized parallel aggregation capability for fast and scalable summarization. We discuss many of the optimizations in DuckDB’s hash aggregate implementation that allow it to efficiently scale to many groups, rows and threads. Finally, we discuss some ideas for future work.

Hannes Mühleisen is a senior researcher at the Database Architectures group within the Centrum Wiskunde & Informatica (CWI). He is also the co-founder and CEO of DuckDB Labs. He received his PhD at the Freie Universität Berlin in 2013.

Mar 18, 2022

Learned DBMS Components 2.0 - From Workload-Driven to Zero-Shot Learning

Carsten Binnig

Database management systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current workload-driven learning approaches to enable learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk we present our vision to tackle the high costs and inflexibility of workload-driven learning called Learned DBMS Components 2.0. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train models that generalize to unseen data sets out-of-the-box. The idea is to train a model that has observed a variety of workloads on different data sets and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss further opportunities which are enabled by zero-shot learning.

Carsten Binnig is a Full Professor in the Computer Science department at at TU Darmstadt and an Adjunct Associate Professor in the Computer Science department at Brown University. Carsten received his PhD at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data management systems, databases and modern hardware as well as machine learning for scalable systems. His work has been awarded with a Google Faculty Award, as well as multiple best paper and best demo awards for his research.

Mar 04, 2022

Efficient collaborative analytics with no information leakage - An idea whose time has come.

Vasiliki Kalavri

Enabling secure outsourced analytics with practical performance has been a long-standing research challenge in the database community. In this talk, I will present our work towards realizing this vision with Secrecy, a new framework for secure relational analytics in untrusted clouds. Secrecy targets offline collaborative analytics, where data owners (hospitals, companies, research institutions, or individuals) are willing to allow certain computations on their collective private data, provided that data remain siloed from untrusted entities. To ensure no information leakage and provable security guarantees, Secrecy relies on cryptographically secure Multi-Party Computation (MPC). Instead of treating MPC as a black box, like prior works, Secrecy exposes the costs of oblivious queries to the planner and employs novel logical, physical, and protocol-specific optimizations, all of which are applicable even when data owners do not participate in the computation. As a result, Secrecy outperforms state-of-the-art systems and can comfortably process much larger datasets with good performance and modest use of resources.

Vasiliki (Vasia) Kalavri is an Assistant Professor of Computer Science at Boston University, where she leads the Complex Analytics and Scalable Processing (CASP) Systems lab. Vasia and her team enjoy doing research on multiple aspects of (distributed) data-centric systems. Recently, they have been working on self-managed systems for data stream processing, systems for scalable graph ML, and MPC systems for private collaborative analytics. Before joining BU, Vasia was a postdoctoral fellow at ETH Zurich and received a joint PhD from KTH (Sweden) and UCLouvain (Belgium).

Mar 04, 2022

Opening the Black Box of Internal Stream Processor State

Jim Verheijde

Distributed streaming dataflow systems have evolved into scalable and fault-tolerant production-grade systems. Their applicability has departed from the mere analysis of stream- ing windows and complex-event processing, and now includes cloud applications and machine learning inference. Although the advancements in the state management of streaming systems have contributed significantly to their maturity, the internal state of streaming operators has been so far hidden from external applications. However, that internal state can be seen as a materialized view that can be used for analytics, monitoring, and debugging.

In this work we argue that exposing the internal state of streaming systems to outside applications by making it queryable, opens the road for novel use cases. To this end, In this talk I will introduce S-QUERY: an approach and reference architecture where the state of stream processors can be queried - either live or through snapshots, achieving different isolation levels. I will show how this new capability can be implemented in an existing open-source stream processor, and how queryable state can affect the performance of such a system.

Jim Verheijde recently joined IMC as a software engineer. Prior to this he received his master degree in computer science and bachelor degree in computer science engineering from Delft University of Technology. Jim also completed a minor at the National University of Singapore and joined the summer school program at Tsinghua university in China. During his studies In Delft, Jim took part in Forze for multiple years where he helped design and build the software for the world’s fastest hydrogen race car.

Dec 17, 2021

OneGraph to Rule them All

Michael Schmidt (Amazon Web Services)

At Amazon Neptune, we work backwards from our customers. One insight that we got from listening to customers is that, in many cases where they explore Neptune as a solution to their problems, it’s primarily “just about graph”: they want to use the relationships in their data to solve business problems using knowledge graphs, identity graphs, fraud graphs, and more.

Dec 17, 2021

Leveraging temporal and topological selectivities in temporal-clique subgraph query processing

Kaijie Zhu (TU Eindhoven)

We study the problem of temporal-clique subgraph pattern matching. In such patterns, edges are required to jointly overlap in time within a given temporal window in addition to forming a topological sub-structure. This problem arises in many application domains, e.g., in social networks, life sciences, smart cities, telecommunications, and others.

Nov 26, 2021

Push-Based Execution in DuckDB

Mark Raasveldt (CWI)

DuckDB has recently switched to a push-based execution model from the initial pull-based execution model.

Nov 26, 2021

Building Advanced SQL Analytics From Low-Level Plan Operators

Thomas Neumann (Technical University of Munich)

Analytical queries virtually always involve aggregation and statistics. SQL offers a wide range of functionalities to summarize data such as associative aggregates, distinct aggregates, ordered-set aggregates, grouping sets, and window functions. In this work, we propose a unified framework for advanced statistics that composes all flavors of complex SQL aggregates from low-level plan operators.

Nov 12, 2021

Fastest table sort in the West - Redesigning DuckDB’s sort

Laurens Kuiper (CWI)

Sorting is one of the most well studied problems in Computer Science. Research in this area forms the basis of sorting in database systems but focuses mostly on sorting large arrays. Sorting is more complex for relational data as many different types need to be supported, as well as NULL values. There can also be multiple order clauses.

Nov 12, 2021

CrocodileDB : Resource Efficient Database Execution

Aaron J. Elmore (University of Chicago)

The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand.

Oct 15, 2021

Data Stations : Combining Data, Compute, and Market Forces

Raul Castro Fernandez (University of Chicago)

In this talk, I will present preliminary work on a new architecture (Data Station) to facilitate data sharing within and across organizations. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone.

Oct 01, 2021

Optimizing machine learning prediction queries and beyond on modern data engines

Konstantinos Karanasos (Microsoft's Gray Systems Lab - Azure Data's applied research group)

Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored.

Oct 01, 2021

Optimisation of Inference Queries

Ziyu Li (TU Delft)

The wide adoption of machine learning (ML) in diverse application domains is resulting in an explosion of available models described by, and stored in model repositories. In application contexts where inference needs are dynamic and subject to strict execution constraints – such as in video processing – the manual selection of an optimal set of models from a large model repository is a nearly impossible task and practitioners typically settle for models with a good average accuracy and performance.

Jul 16, 2021

Data-Intensive Systems in the Microsecond Era

Pinar Tozun (ITU Copenhagen)

Late 2000s and early 2010s have seen the rise of data-intensive systems optimized for in-memory execution. Today, it has been increasingly clear that just optimizing for main memory is neither economically viable nor strictly necessary for high performance. Modern SSDs, such as Z-NAND and Optane, can access data at a latency of around 10 microseconds.

Jul 16, 2021

Charting the Design Space of Query Execution using VOILA

Tim Gubner (CWI)

atabase architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming process that requires manually crafting prototypes with a low chance of generating material insight.

Jul 02, 2021

DuckDQ: Data Quality Validation for Machine Learning Pipelines

Till Döhmen (RWTH Aachen University and Fraunhofer FIT)

Data quality validation plays an important role in ensuring the correct behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production ML systems, we developed DuckDQ: A lightweight and efficient Python library for protecting machine learning pipelines from data errors.

Jun 18, 2021

Teseo and the Analysis of Structural Dynamic Graphs

Dean De Leo (CWI)

Teseo is a new system for the storage and analysis of dynamic structural graphs in main-memory, with the addition of transactional support. It introduces a novel design based on sparse arrays, large arrays interleaved with gaps, and a fat tree, where the graph is ultimately stored. Our design contrasts with early systems for the analysis of dynamic graphs, which often lack transactional support and are anchored to a vertex table as a primary index.

Jun 18, 2021

MxTasks: How to Make Efficient Synchronization and Prefetching Easy

Jens Teubner (TU Dortmund)

The hardware environment has changed rapidly in recent years: Many cores, multiple sockets, and large amounts of main memory have become a commodity. To benefit from these highly parallel systems, the software has to be adapted. Sophisticated latch-free data structures and algorithms are often meant to address the situation.

Jun 04, 2021

Evaluating Matching Techniques with Valentine

Christos Koutras (Delft University of Technology)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use.

Jun 04, 2021

Making Distributed Deep Learning Adaptive

Peter Pietzuch (Imperial College London)

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. Some of these parameters, such as the number of workers, may also change in elastic machine learning scenarios. In current systems, adapting such parameters during training is ill-supported.

May 21, 2021

FASTER: Simplifying Storage for the Modern Edge-Cloud

Badrish Chandramouli (Microsoft Research)

Managing state efficiently in modern applications written for the cloud and edge is hard. In the FASTER project, we have been creating building blocks such as FasterKV and FasterLog to alleviate this problem using techniques such as epoch protection, tiered storage, and asynchronous recoverability.

May 21, 2021

Distributed Transactions on Serverless Stateful Functions

Martijn De Heus (Delft University of Technology)

The cloud promised to make computing accessible by making compute resources easily available and by simplifying computing challenges. While the former has definitely been achieved, there is room for improvement to simplify (distributed) computing challenges. Deployment, scalability and state management are not straightforward to achieve using most cloud offerings. Especially modern microservice architectures face many challenges that distract developers from providing business value. Serverless computing models (like Function-as-a-Service) can greatly simplify deployment and scalability challenges, however the problem of state management remains unsolved by most cloud offerings.

May 07, 2021

Materialize: SQL Incremental View Maintenance

Frank McSherry (Materialize, Inc.)

While OLTP engines excel at maintaining invariants under transactional workloads, and OLAP engines excel at ad-hoc analytics, the relational database is not presently an excellent tool for maintaining the results of computations as data change. This space is currently occupied largely by microservices, bespoke tools that suffer from all the problems you might expect of systems that do not provide many of the ACID properties, and which anecdotally consume engineering departments with their maintenance overhead.

May 07, 2021

Clonos: Consistent Causal Recovery for Highly-Available Streaming Dataflows

Pedro Silvestre (Imperial College London)

Stream processing lies in the backbone of modern businesses, being employed for mission critical applications such as real-time fraud detection, car-trip fare calculations, traffic management, and stock trading. Large-scale applications are executed by scale-out stream processing systems on thousands of long-lived operators, which are subject to failures.

Apr 23, 2021

TED Learn: Towards Technology-Enabled Database Education

Sourav Bhowmick (NTU Singapore)

There is continuous demand for database-literate professionals in today’s market due to widespread usage of relational database management system (RDBMS) in the commercial world. Such commercial demand has played a pivotal role in the offering of database systems course as part of an undergraduate computer science (CS) degree program in major universities around the world. Furthermore, not all working adults dealing with RDBMS have taken an undergraduate database course. Hence, they often need to undergo on-the-job training or attend relevant courses in higher institutes of learning to acquire database literacy. Database courses in major institutes rely on textbooks, lectures, and off-the-shelf RDBMS to impart relevant knowledge such as how SQL queries are processed.

Apr 23, 2021

PG-Keys: Keys for Property Graphs

George Fletcher (TU/e)

I report on a community effort between industry and academia to shape the future of property graph constraints. The standardization for a property graph query language is currently underway through the ISO Graph Query Language (GQL) project. Our position is that this project should pay close attention to schemas and constraints, and should focus next on key constraints.

Apr 09, 2021

Factorization Matters in Large Graphs

Nikolay Yakovets (TU/e)

Evaluation of complex graph pattern queries on large graphs often leads to “explosion” of intermediate results (IR) which, in turn, considerably slows down query processing. In this talk, I will present WireFrame, our recent two-step factorization-based solution which aims to drastically reduce the IR during query processing.

Apr 09, 2021

Aggregation Support for Modern Graph Analytics

Alin Deutsch (UC San Diego)

In this talk I will describe how GSQL, TigerGraph’s graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power, the semantics, and the evaluation complexity of the specified aggregation.

Mar 26, 2021

LeanStore: A High-Performance Storage Engine for Modern Hardware

Viktor Leis (Friedrich Schiller University Jena)

LeanStore is a high-performance OLTP storage engine optimized for many-core CPUs and NVMe SSDs. The goal of the project is to achieve performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for large data sets.

Mar 26, 2021

Vectorized query processing over encrypted data with DuckDB and Intel SGX

Sam Ansmink (CWI, UvA, and VU)

Data confidentiality is an increasingly important requirement for customers outsourcing databases to the cloud. The common approach to achieve data confidentiality in this context is by using encryption. However, processing queries over encrypted data securely and efficiently remains an open issue. To this day, many different approaches to designing encrypted database management systems (EDBMS) have been suggested, for example by using homomorphic encryption or trusted execution environments such as Intel SGX.

Mar 12, 2021

Three techniques for exploiting string compression in data systems

Peter Boncz (CWI & VU)

Actual data in real-life database often is in the form of strings. Strings take significantly more volume than fixed-size data, causing I/O, network traffic, memory traffic and cache traffic. Further, operations on strings tend to be significantly more expensive than operations on e.g. integers, which CPUs do support quite efficiently (let alone GPUs, TPUs - which even do not acknowledge the existense of string data).

Mar 12, 2021

OtterTune : An Automatic Database Configuration Tuning Service

Andy Pavlo (Carnegie Mellon University)

Database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application’s workload can improve the performance and efficiency of the DBMS. But such tuning requires considerable efforts from experienced administrators, which is not scalable for large DBMS fleets. This problem has led to research on using machine learning (ML) to devise strategies to optimize DBMS knobs for any application automatically.

Feb 26, 2021

Integrating Columnar Techniques and Factorization into GraphflowDB

Semih Salihoglu (University of Waterloo)

Graph database management systems (GDBMSs) in contemporary jargon refer to systems that adopt the property graph model and often power applications such as fraud detection and recommendations that require very fast joins of records that represent many-to-many relationships, often beyond the performance that existing relational systems generally provide. In this talk, I will give an overview of GraphflowDB, which is an in-memory GDBMS we are developing at University of Waterloo.

Feb 26, 2021

1000 days of DuckDB - The Pareto Principle still holds

Hannes Mühleisen (CWI)

It has been almost 1000 days since we started work on DuckDB. In this talk, we reflect on the process and revisit the initial goals. We also describe major additions to the system such as automatic query parallelization.