The Dutch Seminar
on Data Systems Design

An initiative to bring together research groups working on data systems in Dutch universities and research institutes.

Fridays3:30–5 pm
monthly

We hold monthly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high quality international speakers. We would like to invite all researchers, especially also PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on by researchers in your field.

Upcoming talks

April 5th, 2024 from 4:00 PM to 5:00 PM (Europe/Amsterdam / CET)

30th Edition Seminar

The 30th Edition seminar of DSDSD will feature talks by
Pedro Holanda .

read more
Mar 22, 2024

Efficient CSV Parsing - On the Complexity of Simple Things

Pedro Holanda

In this talk, we will revisit different CSV parsing implementations in DuckDB and compare them with the current implementation. The bulk of the talk is to discuss the design and implementation decisions in DuckDB’s current CSV Parser. In particular, we will examine the parallel algorithm, the CSV buffer manager, and the transitions of the CSV state machine. Disclaimer: This talk is not for the faint of heart; some very exotically built CSV files will be depicted.

Pedro is an early contributor to DuckDB and currently works as a software engineer at DuckDB Labs, focusing on core and integration aspects of DBMS technology. He completed his PhD at the Database Architectures group at CWI, researching Indexes for Interactive Data Analysis.

Past talks

Feb 23, 2024

Towards LLM-augmented Database Systems

Carsten Binnig

Recent LLMs such as GPT-4-turbo can answer user queries over multi-model data including tables and thus seem to be able to even replace the role of databases in decision-making in the future. However, LLMs have severe limitations since query answering with LLMs not only has problems such as hallucinations but also causes high-performance overheads even for small data sets. In this talk, I suggest a different direction where we use database technology as a starting point and extend it with LLMs where needed for answering user queries over multi-model data. This not only allows us to tackle problems such as the performance overheads of pure LLM-based approaches for multi-modal question-answering but also opens up other opportunities for database systems.

Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and a Visiting Researcher at the Google Systems Research Group. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data systems on modern hardware as well as machine learning for scalable data systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards.

read more
Nov 24, 2023

C3 - Compressing Correlated Columns

Thomas Glas

Open file formats typically use a set of lightweight compression schemes to compress individual columns, taking advantage of data patterns found within the values of each column. However, by compressing columns separately, we do not consider correlations that may exist between columns that may allow us to compress more effectively. Real-world datasets exhibit many such column correlations and research how they can be exploited for compression. In this talk, we introduce C3 (Compressing Correlated Columns), a new compression framework which can exploit correlations between columns for compression. We designed C3 on top of typical lightweight compression infrastructure, and added six new multi-column compression schemes which exploit correlations. We designed our multi-column compression schemes based on correlations we found in real-world datasets, but new compression schemes exploiting other types of correlations can easily be added. C3 uses a sampling-based algorithm to choose the most effective scheme to compress each column. We evaluated the effectiveness of C3 on the Public BI benchmark, containing real-world datasets, and achieved around 20% higher compression ratios compared to using only typical single-column compression schemes.

Thomas Glas is pursuing his master’s degree in computer science at the Technical University of Munich. He joined the Database Architecture Group at CWI in May this year to write his master’s thesis on columnar data compression.

read more
Nov 24, 2023

Lambda functions in the duck's nest

Tania Bogatsch

Many SQL databases do not focus on efficient LIST-type support. Scalar functions and aggregations on LIST values often require additional unnesting steps or loading normalized data. However, nested input formats such as JSON are widespread in analytics. Efficient operations directly on these input formats can leverage the potential of SQL engines while increasing the system’s ease of use. However, using this potential is not trivial, as the LIST type’s underlying storage format and operations have to synergize with the relational execution model. DuckDB is a high-performance relational database system for analytics. In this talk, I’ll showcase DuckDB’s internal design choices to support LISTs efficiently and highlight our support of Python-style list comprehension directly in the SQL dialect.

I studied Computer Science from 2016 to 2022 in Ilmenau, Germany. After my Bachelor’s, I got the opportunity for a four-month internship at the CWI in Amsterdam, where I worked on adaptive expression reordering in DuckDB. In 2022, after finishing my studies, I returned to Amsterdam to work for DuckDB Labs as a software engineer.

read more

Tweets by @DSDSDNL