The Dutch Seminar
on Data Systems Design

An initiative to bring together research groups working on data systems in Dutch universities and research institutes.

Fridays3:30–5 pm
monthly

We hold monthly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high quality international speakers. We would like to invite all researchers, especially also PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on by researchers in your field.

Upcoming talks

October 4th, 2024 from 10:30 AM to 11:30 AM (Europe/Amsterdam / CET)

31st Edition Seminar

The 31st Edition seminar of DSDSD will feature talks by
Gustavo Alonso (ETH Zurich)
Pinar Tözün .

read more
Oct 04, 2024

Data Processing on heterogeneous hardware

Gustavo Alonso (ETH Zurich)

Computing platforms are evolving rapidly along many dimensions: processors, specialization, disaggregation, acceleration, smart memory and storage, etc. Many of these developments are being driven by data science but also arise from the need to make cloud computing more efficient. From a practical perspective, the result we see today is a deluge of possible configurations and deployment options, most of them too new to have a precise idea of their performance implications and lacking proper support in the form of tools and platforms that can manage the underlying diversity. The growing heterogeneity is opening up many opportunities but also raising significant challenges. In the talk I will describe the trend towards specialization at all layers of the architecture, the possibilities it opens up, and demonstrate with real examples how to take advantage of heterogeneous computing platforms. I will also discuss a system we are building for data processing considering heterogeneity both on the software as well as on the hardware side.

Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group (www.systems.ethz.ch) and the head of the Institute of Computing Platforms. He leads the AMD HACC (Heterogeneous Accelerated Compute Cluster) deployment at ETH (https://github.com/fpgasystems/hacc), with several hundred users worldwide, a research facility that supports exploring data center hardware-software co-design. His research interests include data management, cloud computing architecture, and building systems on modern hardware. Gustavo holds degrees in telecommunication from the Madrid Technical University and a MS and PhD in Computer Science from UC Santa Barbara. Previous to joining ETH, he was a research scientist at IBM Almaden in San Jose, California. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievements Award from the European Chapter of ACM SIGOPS (EuroSys).

Oct 04, 2024

Peaceful Sharing while Training Models

Pinar Tözün

Deep learning training is an expensive process that extensively uses GPUs. However, not all model training saturates the resources of a single GPU. This problem gets exacerbated with each new GPU generation offering more hardware resources. In this talk, we will first investigate methods to share GPU resources across model training jobs by collocating these jobs on the same GPU to improve hardware utilization. Then, we will explore work sharing opportunities in the data pipelines of model training, furthering the benefits of collocated training.

Pınar Tözün is an Associate Professor at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on resource-aware machine learning, performance characterization of data-intensive systems, and scalability and efficiency of data-intensive systems on modern hardware.

Past talks

Mar 22, 2024

Efficient CSV Parsing - On the Complexity of Simple Things

Pedro Holanda

In this talk, we will revisit different CSV parsing implementations in DuckDB and compare them with the current implementation. The bulk of the talk is to discuss the design and implementation decisions in DuckDB’s current CSV Parser. In particular, we will examine the parallel algorithm, the CSV buffer manager, and the transitions of the CSV state machine. Disclaimer: This talk is not for the faint of heart; some very exotically built CSV files will be depicted.

Pedro is an early contributor to DuckDB and currently works as a software engineer at DuckDB Labs, focusing on core and integration aspects of DBMS technology. He completed his PhD at the Database Architectures group at CWI, researching Indexes for Interactive Data Analysis.

read more
Feb 23, 2024

Towards LLM-augmented Database Systems

Carsten Binnig

Recent LLMs such as GPT-4-turbo can answer user queries over multi-model data including tables and thus seem to be able to even replace the role of databases in decision-making in the future. However, LLMs have severe limitations since query answering with LLMs not only has problems such as hallucinations but also causes high-performance overheads even for small data sets. In this talk, I suggest a different direction where we use database technology as a starting point and extend it with LLMs where needed for answering user queries over multi-model data. This not only allows us to tackle problems such as the performance overheads of pure LLM-based approaches for multi-modal question-answering but also opens up other opportunities for database systems.

Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and a Visiting Researcher at the Google Systems Research Group. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data systems on modern hardware as well as machine learning for scalable data systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards.

read more
Nov 24, 2023

C3 - Compressing Correlated Columns

Thomas Glas

Open file formats typically use a set of lightweight compression schemes to compress individual columns, taking advantage of data patterns found within the values of each column. However, by compressing columns separately, we do not consider correlations that may exist between columns that may allow us to compress more effectively. Real-world datasets exhibit many such column correlations and research how they can be exploited for compression. In this talk, we introduce C3 (Compressing Correlated Columns), a new compression framework which can exploit correlations between columns for compression. We designed C3 on top of typical lightweight compression infrastructure, and added six new multi-column compression schemes which exploit correlations. We designed our multi-column compression schemes based on correlations we found in real-world datasets, but new compression schemes exploiting other types of correlations can easily be added. C3 uses a sampling-based algorithm to choose the most effective scheme to compress each column. We evaluated the effectiveness of C3 on the Public BI benchmark, containing real-world datasets, and achieved around 20% higher compression ratios compared to using only typical single-column compression schemes.

Thomas Glas is pursuing his master’s degree in computer science at the Technical University of Munich. He joined the Database Architecture Group at CWI in May this year to write his master’s thesis on columnar data compression.

read more

Tweets by @DSDSDNL