Past talks

May 13, 2022

Algorithms for Relational Knowledge Graphs

Martin Bravenboer

RelationalAI is the next-generation database system for new intelligent data applications based on relational knowledge graphs. RelationalAI complements the modern data stack by allowing data applications to be implemented relationally and declaratively, leveraging knowledge/semantics for reasoning, graph analytics, relational machine learning, and mathematical optimization workloads. RelationalAI as a relational and cloud native system fits naturally in the modern data stack, providing virtually infinite compute and storage capacity, versioning, and a fully managed system. RelationalAI supports the workload of data applications with an expressive relational language (called Rel), novel join algorithms and JIT compilation suitable for complex computational workloads, semantic optimization that leverages knowledge to optimize application logic, and incrementality of the entire system for both data (IVM) and code (live programming). The system utilizes immutable data structures, versioning, parallelism, distribution, out-of-core memory management to support state-of-the-art workload isolation and scalability for simple as well as complex business logic. In our experience, RelationalAI’s expressive, relational, and declarative language leads to a 10-100x reduction in code for complex business domains. Applications are developed faster, with superior quality by bringing non-technical domain experts into the process and by automating away complex programming tasks. We discuss the core innovations that underpin the RelationalAI system: an expressive relational language, worst-case optimal join algorithms, semantic optimization, just-in-time compilation, schema discovery and evolution, incrementality and immutability.

Martin Bravenboer is VP Engineering at RelationalAI where he leads the development of the RelationalAI system. Before RelationalAI, he was CTO at LogicBlox. As a postdoctoral researcher with Prof. Yannis Smaragdakis, he developed the Doop framework for declarative and precise points-to analysis that uses the LogicBlox system. Martin obtained his PhD at Utrecht University in the area of language design and compiler construction.

read more
May 13, 2022

The LDBC Social Network Benchmark - Business Intelligence workload

Gábor Szárnyas

Graph data management techniques are employed in several domains such as finance and enterprise knowledge representation for evaluating graph pattern matching and path finding queries on large data sets. Supporting such queries efficiently yields a number of unique requirements, including the need for a concise query language and graph-aware query optimization techniques. The goal of the Linked Data Benchmark Council (LDBC) is to design standard benchmarks which capture representative categories of graph data management problems, making the performance of systems comparable and facilitating competition among vendors. This talk describes the Business Intelligence workload, a graph OLAP benchmark with global graph queries that use pattern matching, path finding, and aggregation operations. The workload is executed on a dynamic social network graph updated in daily batches of inserts and deletes. We discuss the design process of the benchmark and present its first stable version.

Gábor Szárnyas is a post-doctoral researcher. He obtained his PhD in software engineering in 2019, focusing on the intersection of object-oriented graph models and property graphs. He currently works on efficient graph processing techniques, including formulating graph algorithms in the language of linear algebra (GraphBLAS), implementing graph query engines (SQL/PGQ), and designing graph benchmarks. He serves on the steering committee of the Linked Data Benchmark Council.

Slides

read more
Apr 29, 2022

Glidesort - Efficient In-Memory Adaptive Stable Sorting on Modern Hardware

Orson Peters

Sorting is one of the most common algorithms used in programming, and virtually every standard library contains a routine for it. Despite also being one of the oldest problems out there, surprisingly large improvements are still being found. Some of these are fundamental novelties, and others are optimizations matching the changing performance landscape in modern hardware.

In this talk we present Glidesort, a general purpose in-memory stable comparison sort. It is fully adaptive to both pre-sorted runs in the data similar to Timsort, and low-cardinality inputs similar to Pattern-defeating Quicksort, making it to our knowledge the first practical stable sorting algorithm fully adaptive in both measures. Glidesort achieves a 3x speedup over a Rust’s standard library Timsort routine on sorting random 32-bit integers, with the speedup breaking the order of magnitude barrier for realistic low-cardinality distributions. It achieves this without the use of SIMD, processor-specific intrinsics or assumptions about the type being sorted: it is a fully generic sort taking an arbitrary comparison operator.

Using Glidesort as the motivating example we discuss the principles of efficient stable in-memory partitioning and merging on modern hardware. In particular attention is paid to eliminating branches and interleaving independent parallel loops to efficiently use our modern deeply-pipelined superscalar processors. The lessons learned here are widely applicable to efficient data processing outside of sorting.

Orson Peters is a first-year PhD student at the Database Architecture group at CWI Amsterdam. His research interests are very broad, and span low-level optimization, compression, information theory, cryptography, (parallel) data structures, string processing and more. In particular sorting is an interest, having published pdqsort in 2015 which is now the default unstable sorting algorithm in Rust and Go. His alma mater is Leiden University, where he did his BSc and MSc in Computer Science, specializing in Artificial Intelligence.

read more
Apr 29, 2022

Taking a Peek under the Hood of Snowflake's Metadata Management

Max Heimel

This talk provides an overview of Snowflake’s architecture that was designed to efficiently support complex analytical workloads in the cloud. Looking at the lifecycle of micro partitions, this talk explains pruning, zero-copy cloning, and instant time travel. Pruning is a technique to speed up query processing by filtering out unnecessary micro partitions during query compilation. Zero-copy cloning allows the creation of logical copies of the data without duplicating physical storage. Instant time travel enables the user to query data “as of” a time in the past, even if the current state of the data has changed. We also describe how we utilize cloud resources to automatically reorganize (“cluster”) micro partitions in the background in order to achieve consistent query performance without affecting running customer queries.

Max Heimel holds a PhD in Computer Science from the Database and Information Management Group at TU Berlin. He joined Snowflake in 2015 and is working as a Software Engineer in the areas of query execution and query optimization. Before joining Snowflake, Max worked at IBM and spent several internships at Google.

read more
Mar 18, 2022

Parallel Grouped Aggregation in DuckDB

Hannes Mühleisen

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. The main issue when computing grouping results is that the groups can occur in the input table in any order, making it difficult to efficiently match grouping keys. Of course, the input can be sorted, but this is computationally expensive. Building a hash table has a lower computational complexity than sorting and is therefore generally preferred, but this requires collision handling. How does parallelism work together with hash tables? In general, the answer is unfortunately: “Badly”. Hash tables are delicate structures that do not handle parallel modifications well. In this talk we present DuckDB’s highly optimized parallel aggregation capability for fast and scalable summarization. We discuss many of the optimizations in DuckDB’s hash aggregate implementation that allow it to efficiently scale to many groups, rows and threads. Finally, we discuss some ideas for future work.

Hannes Mühleisen is a senior researcher at the Database Architectures group within the Centrum Wiskunde & Informatica (CWI). He is also the co-founder and CEO of DuckDB Labs. He received his PhD at the Freie Universität Berlin in 2013.

read more
Mar 18, 2022

Learned DBMS Components 2.0 - From Workload-Driven to Zero-Shot Learning

Carsten Binnig

Database management systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current workload-driven learning approaches to enable learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk we present our vision to tackle the high costs and inflexibility of workload-driven learning called Learned DBMS Components 2.0. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train models that generalize to unseen data sets out-of-the-box. The idea is to train a model that has observed a variety of workloads on different data sets and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss further opportunities which are enabled by zero-shot learning.

Carsten Binnig is a Full Professor in the Computer Science department at at TU Darmstadt and an Adjunct Associate Professor in the Computer Science department at Brown University. Carsten received his PhD at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data management systems, databases and modern hardware as well as machine learning for scalable systems. His work has been awarded with a Google Faculty Award, as well as multiple best paper and best demo awards for his research.

read more
Mar 04, 2022

Efficient collaborative analytics with no information leakage - An idea whose time has come.

Vasiliki Kalavri

Enabling secure outsourced analytics with practical performance has been a long-standing research challenge in the database community. In this talk, I will present our work towards realizing this vision with Secrecy, a new framework for secure relational analytics in untrusted clouds. Secrecy targets offline collaborative analytics, where data owners (hospitals, companies, research institutions, or individuals) are willing to allow certain computations on their collective private data, provided that data remain siloed from untrusted entities. To ensure no information leakage and provable security guarantees, Secrecy relies on cryptographically secure Multi-Party Computation (MPC). Instead of treating MPC as a black box, like prior works, Secrecy exposes the costs of oblivious queries to the planner and employs novel logical, physical, and protocol-specific optimizations, all of which are applicable even when data owners do not participate in the computation. As a result, Secrecy outperforms state-of-the-art systems and can comfortably process much larger datasets with good performance and modest use of resources.

Vasiliki (Vasia) Kalavri is an Assistant Professor of Computer Science at Boston University, where she leads the Complex Analytics and Scalable Processing (CASP) Systems lab. Vasia and her team enjoy doing research on multiple aspects of (distributed) data-centric systems. Recently, they have been working on self-managed systems for data stream processing, systems for scalable graph ML, and MPC systems for private collaborative analytics. Before joining BU, Vasia was a postdoctoral fellow at ETH Zurich and received a joint PhD from KTH (Sweden) and UCLouvain (Belgium).

read more
Mar 04, 2022

Opening the Black Box of Internal Stream Processor State

Jim Verheijde

Distributed streaming dataflow systems have evolved into scalable and fault-tolerant production-grade systems. Their applicability has departed from the mere analysis of stream- ing windows and complex-event processing, and now includes cloud applications and machine learning inference. Although the advancements in the state management of streaming systems have contributed significantly to their maturity, the internal state of streaming operators has been so far hidden from external applications. However, that internal state can be seen as a materialized view that can be used for analytics, monitoring, and debugging.

In this work we argue that exposing the internal state of streaming systems to outside applications by making it queryable, opens the road for novel use cases. To this end, In this talk I will introduce S-QUERY: an approach and reference architecture where the state of stream processors can be queried - either live or through snapshots, achieving different isolation levels. I will show how this new capability can be implemented in an existing open-source stream processor, and how queryable state can affect the performance of such a system.

Jim Verheijde recently joined IMC as a software engineer. Prior to this he received his master degree in computer science and bachelor degree in computer science engineering from Delft University of Technology. Jim also completed a minor at the National University of Singapore and joined the summer school program at Tsinghua university in China. During his studies In Delft, Jim took part in Forze for multiple years where he helped design and build the software for the world’s fastest hydrogen race car.

read more
Dec 17, 2021

OneGraph to Rule them All

Michael Schmidt (Amazon Web Services)

At Amazon Neptune, we work backwards from our customers. One insight that we got from listening to customers is that, in many cases where they explore Neptune as a solution to their problems, it’s primarily “just about graph”: they want to use the relationships in their data to solve business problems using knowledge graphs, identity graphs, fraud graphs, and more.

read more
Nov 26, 2021

Building Advanced SQL Analytics From Low-Level Plan Operators

Thomas Neumann (Technical University of Munich)

Analytical queries virtually always involve aggregation and statistics. SQL offers a wide range of functionalities to summarize data such as associative aggregates, distinct aggregates, ordered-set aggregates, grouping sets, and window functions. In this work, we propose a unified framework for advanced statistics that composes all flavors of complex SQL aggregates from low-level plan operators.

read more
Nov 12, 2021

Fastest table sort in the West - Redesigning DuckDB’s sort

Laurens Kuiper (CWI)

Sorting is one of the most well studied problems in Computer Science. Research in this area forms the basis of sorting in database systems but focuses mostly on sorting large arrays. Sorting is more complex for relational data as many different types need to be supported, as well as NULL values. There can also be multiple order clauses.

read more
Nov 12, 2021

CrocodileDB : Resource Efficient Database Execution

Aaron J. Elmore (University of Chicago)

The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand.

read more
Oct 15, 2021

Data Stations : Combining Data, Compute, and Market Forces

Raul Castro Fernandez (University of Chicago)

In this talk, I will present preliminary work on a new architecture (Data Station) to facilitate data sharing within and across organizations. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone.

read more
Oct 01, 2021

Optimizing machine learning prediction queries and beyond on modern data engines

Konstantinos Karanasos (Microsoft's Gray Systems Lab - Azure Data's applied research group)

Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored.

read more
Oct 01, 2021

Optimisation of Inference Queries

Ziyu Li (TU Delft)

The wide adoption of machine learning (ML) in diverse application domains is resulting in an explosion of available models described by, and stored in model repositories. In application contexts where inference needs are dynamic and subject to strict execution constraints – such as in video processing – the manual selection of an optimal set of models from a large model repository is a nearly impossible task and practitioners typically settle for models with a good average accuracy and performance.

read more
Jul 16, 2021

Data-Intensive Systems in the Microsecond Era

Pinar Tozun (ITU Copenhagen)

Late 2000s and early 2010s have seen the rise of data-intensive systems optimized for in-memory execution. Today, it has been increasingly clear that just optimizing for main memory is neither economically viable nor strictly necessary for high performance. Modern SSDs, such as Z-NAND and Optane, can access data at a latency of around 10 microseconds.

read more
Jul 16, 2021

Charting the Design Space of Query Execution using VOILA

Tim Gubner (CWI)

atabase architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming process that requires manually crafting prototypes with a low chance of generating material insight.

read more
Jul 02, 2021

DuckDQ: Data Quality Validation for Machine Learning Pipelines

Till Döhmen (RWTH Aachen University and Fraunhofer FIT)

Data quality validation plays an important role in ensuring the correct behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production ML systems, we developed DuckDQ: A lightweight and efficient Python library for protecting machine learning pipelines from data errors.

read more
Jun 18, 2021

Teseo and the Analysis of Structural Dynamic Graphs

Dean De Leo (CWI)

Teseo is a new system for the storage and analysis of dynamic structural graphs in main-memory, with the addition of transactional support. It introduces a novel design based on sparse arrays, large arrays interleaved with gaps, and a fat tree, where the graph is ultimately stored. Our design contrasts with early systems for the analysis of dynamic graphs, which often lack transactional support and are anchored to a vertex table as a primary index.

read more
Jun 18, 2021

MxTasks: How to Make Efficient Synchronization and Prefetching Easy

Jens Teubner (TU Dortmund)

The hardware environment has changed rapidly in recent years: Many cores, multiple sockets, and large amounts of main memory have become a commodity. To benefit from these highly parallel systems, the software has to be adapted. Sophisticated latch-free data structures and algorithms are often meant to address the situation.

read more
Jun 04, 2021

Evaluating Matching Techniques with Valentine

Christos Koutras (Delft University of Technology)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use.

read more
Jun 04, 2021

Making Distributed Deep Learning Adaptive

Peter Pietzuch (Imperial College London)

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. Some of these parameters, such as the number of workers, may also change in elastic machine learning scenarios. In current systems, adapting such parameters during training is ill-supported.

read more
May 21, 2021

FASTER: Simplifying Storage for the Modern Edge-Cloud

Badrish Chandramouli (Microsoft Research)

Managing state efficiently in modern applications written for the cloud and edge is hard. In the FASTER project, we have been creating building blocks such as FasterKV and FasterLog to alleviate this problem using techniques such as epoch protection, tiered storage, and asynchronous recoverability.

read more
May 21, 2021

Distributed Transactions on Serverless Stateful Functions

Martijn De Heus (Delft University of Technology)

The cloud promised to make computing accessible by making compute resources easily available and by simplifying computing challenges. While the former has definitely been achieved, there is room for improvement to simplify (distributed) computing challenges. Deployment, scalability and state management are not straightforward to achieve using most cloud offerings. Especially modern microservice architectures face many challenges that distract developers from providing business value. Serverless computing models (like Function-as-a-Service) can greatly simplify deployment and scalability challenges, however the problem of state management remains unsolved by most cloud offerings.

read more
May 07, 2021

Materialize: SQL Incremental View Maintenance

Frank McSherry (Materialize, Inc.)

While OLTP engines excel at maintaining invariants under transactional workloads, and OLAP engines excel at ad-hoc analytics, the relational database is not presently an excellent tool for maintaining the results of computations as data change. This space is currently occupied largely by microservices, bespoke tools that suffer from all the problems you might expect of systems that do not provide many of the ACID properties, and which anecdotally consume engineering departments with their maintenance overhead.

read more
May 07, 2021

Clonos: Consistent Causal Recovery for Highly-Available Streaming Dataflows

Pedro Silvestre (Imperial College London)

Stream processing lies in the backbone of modern businesses, being employed for mission critical applications such as real-time fraud detection, car-trip fare calculations, traffic management, and stock trading. Large-scale applications are executed by scale-out stream processing systems on thousands of long-lived operators, which are subject to failures.

read more
Apr 23, 2021

TED Learn: Towards Technology-Enabled Database Education

Sourav Bhowmick (NTU Singapore)

There is continuous demand for database-literate professionals in today’s market due to widespread usage of relational database management system (RDBMS) in the commercial world. Such commercial demand has played a pivotal role in the offering of database systems course as part of an undergraduate computer science (CS) degree program in major universities around the world. Furthermore, not all working adults dealing with RDBMS have taken an undergraduate database course. Hence, they often need to undergo on-the-job training or attend relevant courses in higher institutes of learning to acquire database literacy. Database courses in major institutes rely on textbooks, lectures, and off-the-shelf RDBMS to impart relevant knowledge such as how SQL queries are processed.

read more
Apr 23, 2021

PG-Keys: Keys for Property Graphs

George Fletcher (TU/e)

I report on a community effort between industry and academia to shape the future of property graph constraints. The standardization for a property graph query language is currently underway through the ISO Graph Query Language (GQL) project. Our position is that this project should pay close attention to schemas and constraints, and should focus next on key constraints.

read more
Apr 09, 2021

Factorization Matters in Large Graphs

Nikolay Yakovets (TU/e)

Evaluation of complex graph pattern queries on large graphs often leads to “explosion” of intermediate results (IR) which, in turn, considerably slows down query processing. In this talk, I will present WireFrame, our recent two-step factorization-based solution which aims to drastically reduce the IR during query processing.

read more
Apr 09, 2021

Aggregation Support for Modern Graph Analytics

Alin Deutsch (UC San Diego)

In this talk I will describe how GSQL, TigerGraph’s graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power, the semantics, and the evaluation complexity of the specified aggregation.

read more
Mar 26, 2021

LeanStore: A High-Performance Storage Engine for Modern Hardware

Viktor Leis (Friedrich Schiller University Jena)

LeanStore is a high-performance OLTP storage engine optimized for many-core CPUs and NVMe SSDs. The goal of the project is to achieve performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for large data sets.

read more
Mar 26, 2021

Vectorized query processing over encrypted data with DuckDB and Intel SGX

Sam Ansmink (CWI, UvA, and VU)

Data confidentiality is an increasingly important requirement for customers outsourcing databases to the cloud. The common approach to achieve data confidentiality in this context is by using encryption. However, processing queries over encrypted data securely and efficiently remains an open issue. To this day, many different approaches to designing encrypted database management systems (EDBMS) have been suggested, for example by using homomorphic encryption or trusted execution environments such as Intel SGX.

read more
Mar 12, 2021

Three techniques for exploiting string compression in data systems

Peter Boncz (CWI & VU)

Actual data in real-life database often is in the form of strings. Strings take significantly more volume than fixed-size data, causing I/O, network traffic, memory traffic and cache traffic. Further, operations on strings tend to be significantly more expensive than operations on e.g. integers, which CPUs do support quite efficiently (let alone GPUs, TPUs - which even do not acknowledge the existense of string data).

read more
Mar 12, 2021

OtterTune : An Automatic Database Configuration Tuning Service

Andy Pavlo (Carnegie Mellon University)

Database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application’s workload can improve the performance and efficiency of the DBMS. But such tuning requires considerable efforts from experienced administrators, which is not scalable for large DBMS fleets. This problem has led to research on using machine learning (ML) to devise strategies to optimize DBMS knobs for any application automatically.

read more
Feb 26, 2021

Integrating Columnar Techniques and Factorization into GraphflowDB

Semih Salihoglu (University of Waterloo)

Graph database management systems (GDBMSs) in contemporary jargon refer to systems that adopt the property graph model and often power applications such as fraud detection and recommendations that require very fast joins of records that represent many-to-many relationships, often beyond the performance that existing relational systems generally provide. In this talk, I will give an overview of GraphflowDB, which is an in-memory GDBMS we are developing at University of Waterloo.

read more
Feb 26, 2021

1000 days of DuckDB - The Pareto Principle still holds

Hannes Mühleisen (CWI)

It has been almost 1000 days since we started work on DuckDB. In this talk, we reflect on the process and revisit the initial goals. We also describe major additions to the system such as automatic query parallelization.

read more