The Dutch Seminar
on Data Systems Design

An initiative to bring together research groups working on data systems in Dutch universities and research institutes.

Fridays4–5 pm
monthly

We hold monthly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high quality international speakers. We would like to invite all researchers, especially also PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on by researchers in your field.

Upcoming talks

March 14th, 2025 from 16:00 PM to 17:00 PM (Europe/Amsterdam / CET)

32nd Edition Seminar

The 32nd Edition seminar of DSDSD will feature talks by
Marcel Weisgut (HPI)
Nils Strassenburg (HPI)

read more
Mar 14, 2025

Evaluating CXL Memory Performance

Marcel Weisgut (HPI)

The Compute Express Link (CXL) standard enables new forms of memory management and access across devices and servers. Based on PCIe, it enables cache coherent access to remote memory. This widens the design space for database systems by expanding the available memory beyond memory local to the CPU. Efficiently utilizing CXL-attached memory requires conscious decisions of data systems about data placement and management. In this work, we provide an in-depth analysis of database operation performance with data interleaved across multiple CXL memory devices. We evaluate memory access performance for basic access patterns and the performance impact of placing data across multiple CXL memory devices for in-memory column scans and in-memory B+tree operations.

Marcel Weisgut is a PhD student at the Hasso Plattner Institute, specializing in data management utilizing modern hardware in the Data Engineering Systems group led by Tilmann Rabl. He received his master’s degree from HPI in 2021, focusing on in-memory data management in Hasso Plattner’s research group. During his master’s studies, he contributed to the columnar open-source in-memory database system Hyrise and interned with the SAP HANA development team at SAP Labs Korea. His current research focuses on utilizing memory attached to a CPU via the cache-coherent interconnect Compute Express Link (CXL) for database systems.

Mar 14, 2025

Alsatian - Optimizing Model Search for Deep Transfer Learning

Nils Strassenburg (HPI)

Transfer learning is an effective technique for tuning a deep learning model when training data or computational resources are limited. Instead of training a new model from scratch, the parameters of an existing “base model” are adjusted for a new task. The accuracy of such a fine-tuned model depends on choosing an appropriate base model. Model search automates the selection of such a base model by evaluating the suitability of candidate models for a specific task. This entails inference with each candidate model on task-specific data. With thousands of models available through model stores, the computational cost of model search is a major bottleneck for efficient transfer learning. In this work, we present Alsatian, a novel model search system. Based on the observation that many candidate models overlap to a significant extent and based on a careful bottleneck analysis, we propose optimization techniques that are applicable to many model search frameworks. These optimizations include: (i) splitting models into individual blocks that can be shared across models, (ii) caching of intermediate inference results and model blocks, and (iii) selecting a beneficial search order for models to maximize sharing of cached results. In our evaluation on state-of-the-art deep learning models from computer vision and natural language processing, we show that Alsatian outperforms baselines by up to ~14×.

Nils is a PhD student in the Database Group at the Hasso Plattner Institute (HPI) in Potsdam, under the supervision of Tilmann Rabl. His research focuses on ML systems, particularly ML model management and search. In addition to his research, he contributes to the lecture on big data systems, leads seminars on ML systems, and supervises master’s theses. Before starting his PhD, he earned a master’s degree in IT-Systems Engineering from HPI and a bachelor’s degree in Computer Science from the University of Hamburg. As part of his studies, he completed a six-month internship at SAP Labs France in Sophia Antipolis and spent a semester at ETH Zurich.

Past talks

Oct 04, 2024

Peaceful Sharing while Training Models

Pinar Tözün

Deep learning training is an expensive process that extensively uses GPUs. However, not all model training saturates the resources of a single GPU. This problem gets exacerbated with each new GPU generation offering more hardware resources. In this talk, we will first investigate methods to share GPU resources across model training jobs by collocating these jobs on the same GPU to improve hardware utilization. Then, we will explore work sharing opportunities in the data pipelines of model training, furthering the benefits of collocated training.

Pınar Tözün is an Associate Professor at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on resource-aware machine learning, performance characterization of data-intensive systems, and scalability and efficiency of data-intensive systems on modern hardware.

read more
Oct 04, 2024

Data Processing on heterogeneous hardware

Gustavo Alonso (ETH Zurich)

Computing platforms are evolving rapidly along many dimensions: processors, specialization, disaggregation, acceleration, smart memory and storage, etc. Many of these developments are being driven by data science but also arise from the need to make cloud computing more efficient. From a practical perspective, the result we see today is a deluge of possible configurations and deployment options, most of them too new to have a precise idea of their performance implications and lacking proper support in the form of tools and platforms that can manage the underlying diversity. The growing heterogeneity is opening up many opportunities but also raising significant challenges. In the talk I will describe the trend towards specialization at all layers of the architecture, the possibilities it opens up, and demonstrate with real examples how to take advantage of heterogeneous computing platforms. I will also discuss a system we are building for data processing considering heterogeneity both on the software as well as on the hardware side.

Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group (www.systems.ethz.ch) and the head of the Institute of Computing Platforms. He leads the AMD HACC (Heterogeneous Accelerated Compute Cluster) deployment at ETH (https://github.com/fpgasystems/hacc), with several hundred users worldwide, a research facility that supports exploring data center hardware-software co-design. His research interests include data management, cloud computing architecture, and building systems on modern hardware. Gustavo holds degrees in telecommunication from the Madrid Technical University and a MS and PhD in Computer Science from UC Santa Barbara. Previous to joining ETH, he was a research scientist at IBM Almaden in San Jose, California. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).

read more
Mar 22, 2024

Efficient CSV Parsing - On the Complexity of Simple Things

Pedro Holanda

In this talk, we will revisit different CSV parsing implementations in DuckDB and compare them with the current implementation. The bulk of the talk is to discuss the design and implementation decisions in DuckDB’s current CSV Parser. In particular, we will examine the parallel algorithm, the CSV buffer manager, and the transitions of the CSV state machine. Disclaimer: This talk is not for the faint of heart; some very exotically built CSV files will be depicted.

Pedro is an early contributor to DuckDB and currently works as a software engineer at DuckDB Labs, focusing on core and integration aspects of DBMS technology. He completed his PhD at the Database Architectures group at CWI, researching Indexes for Interactive Data Analysis.

read more

Tweets by @DSDSDNL