At Amazon Neptune, we work backwards from our customers. One insight that we got from listening to customers is that, in many cases where they explore Neptune as a solution to their problems, it’s primarily “just about graph”: they want to use the relationships in their data to solve business problems using knowledge graphs, identity graphs, fraud graphs, and more.
We study the problem of temporal-clique subgraph pattern matching. In such patterns, edges are required to jointly overlap in time within a given temporal window in addition to forming a topological sub-structure. This problem arises in many application domains, e.g., in social networks, life sciences, smart cities, telecommunications, and others.
DuckDB has recently switched to a push-based execution model from the initial pull-based execution model.
Analytical queries virtually always involve aggregation and statistics. SQL offers a wide range of functionalities to summarize data such as associative aggregates, distinct aggregates, ordered-set aggregates, grouping sets, and window functions. In this work, we propose a unified framework for advanced statistics that composes all flavors of complex SQL aggregates from low-level plan operators.
Sorting is one of the most well studied problems in Computer Science. Research in this area forms the basis of sorting in database systems but focuses mostly on sorting large arrays. Sorting is more complex for relational data as many different types need to be supported, as well as NULL values. There can also be multiple order clauses.
The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand.
In this talk, I will present preliminary work on a new architecture (Data Station) to facilitate data sharing within and across organizations. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone.
Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored.
The wide adoption of machine learning (ML) in diverse application domains is resulting in an explosion of available models described by, and stored in model repositories. In application contexts where inference needs are dynamic and subject to strict execution constraints – such as in video processing – the manual selection of an optimal set of models from a large model repository is a nearly impossible task and practitioners typically settle for models with a good average accuracy and performance.
Late 2000s and early 2010s have seen the rise of data-intensive systems optimized for in-memory execution. Today, it has been increasingly clear that just optimizing for main memory is neither economically viable nor strictly necessary for high performance. Modern SSDs, such as Z-NAND and Optane, can access data at a latency of around 10 microseconds.
atabase architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming process that requires manually crafting prototypes with a low chance of generating material insight.
Data quality validation plays an important role in ensuring the correct behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production ML systems, we developed DuckDQ: A lightweight and efficient Python library for protecting machine learning pipelines from data errors.
Teseo is a new system for the storage and analysis of dynamic structural graphs in main-memory, with the addition of transactional support. It introduces a novel design based on sparse arrays, large arrays interleaved with gaps, and a fat tree, where the graph is ultimately stored. Our design contrasts with early systems for the analysis of dynamic graphs, which often lack transactional support and are anchored to a vertex table as a primary index.
The hardware environment has changed rapidly in recent years: Many cores, multiple sockets, and large amounts of main memory have become a commodity. To benefit from these highly parallel systems, the software has to be adapted. Sophisticated latch-free data structures and algorithms are often meant to address the situation.
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use.
When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. Some of these parameters, such as the number of workers, may also change in elastic machine learning scenarios. In current systems, adapting such parameters during training is ill-supported.
Managing state efficiently in modern applications written for the cloud and edge is hard. In the FASTER project, we have been creating building blocks such as FasterKV and FasterLog to alleviate this problem using techniques such as epoch protection, tiered storage, and asynchronous recoverability.
The cloud promised to make computing accessible by making compute resources easily available and by simplifying computing challenges. While the former has definitely been achieved, there is room for improvement to simplify (distributed) computing challenges. Deployment, scalability and state management are not straightforward to achieve using most cloud offerings. Especially modern microservice architectures face many challenges that distract developers from providing business value. Serverless computing models (like Function-as-a-Service) can greatly simplify deployment and scalability challenges, however the problem of state management remains unsolved by most cloud offerings.
While OLTP engines excel at maintaining invariants under transactional workloads, and OLAP engines excel at ad-hoc analytics, the relational database is not presently an excellent tool for maintaining the results of computations as data change. This space is currently occupied largely by microservices, bespoke tools that suffer from all the problems you might expect of systems that do not provide many of the ACID properties, and which anecdotally consume engineering departments with their maintenance overhead.
Stream processing lies in the backbone of modern businesses, being employed for mission critical applications such as real-time fraud detection, car-trip fare calculations, traffic management, and stock trading. Large-scale applications are executed by scale-out stream processing systems on thousands of long-lived operators, which are subject to failures.
There is continuous demand for database-literate professionals in today’s market due to widespread usage of relational database management system (RDBMS) in the commercial world. Such commercial demand has played a pivotal role in the offering of database systems course as part of an undergraduate computer science (CS) degree program in major universities around the world. Furthermore, not all working adults dealing with RDBMS have taken an undergraduate database course. Hence, they often need to undergo on-the-job training or attend relevant courses in higher institutes of learning to acquire database literacy. Database courses in major institutes rely on textbooks, lectures, and off-the-shelf RDBMS to impart relevant knowledge such as how SQL queries are processed.
I report on a community effort between industry and academia to shape the future of property graph constraints. The standardization for a property graph query language is currently underway through the ISO Graph Query Language (GQL) project. Our position is that this project should pay close attention to schemas and constraints, and should focus next on key constraints.
Evaluation of complex graph pattern queries on large graphs often leads to “explosion” of intermediate results (IR) which, in turn, considerably slows down query processing. In this talk, I will present WireFrame, our recent two-step factorization-based solution which aims to drastically reduce the IR during query processing.
In this talk I will describe how GSQL, TigerGraph’s graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power, the semantics, and the evaluation complexity of the specified aggregation.
LeanStore is a high-performance OLTP storage engine optimized for many-core CPUs and NVMe SSDs. The goal of the project is to achieve performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for large data sets.
Data confidentiality is an increasingly important requirement for customers outsourcing databases to the cloud. The common approach to achieve data confidentiality in this context is by using encryption. However, processing queries over encrypted data securely and efficiently remains an open issue. To this day, many different approaches to designing encrypted database management systems (EDBMS) have been suggested, for example by using homomorphic encryption or trusted execution environments such as Intel SGX.
Actual data in real-life database often is in the form of strings. Strings take significantly more volume than fixed-size data, causing I/O, network traffic, memory traffic and cache traffic. Further, operations on strings tend to be significantly more expensive than operations on e.g. integers, which CPUs do support quite efficiently (let alone GPUs, TPUs - which even do not acknowledge the existense of string data).
Database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application’s workload can improve the performance and efficiency of the DBMS. But such tuning requires considerable efforts from experienced administrators, which is not scalable for large DBMS fleets. This problem has led to research on using machine learning (ML) to devise strategies to optimize DBMS knobs for any application automatically.
Graph database management systems (GDBMSs) in contemporary jargon refer to systems that adopt the property graph model and often power applications such as fraud detection and recommendations that require very fast joins of records that represent many-to-many relationships, often beyond the performance that existing relational systems generally provide. In this talk, I will give an overview of GraphflowDB, which is an in-memory GDBMS we are developing at University of Waterloo.
It has been almost 1000 days since we started work on DuckDB. In this talk, we reflect on the process and revisit the initial goals. We also describe major additions to the system such as automatic query parallelization.