Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. The main issue when computing grouping results is that the groups can occur in the input table in any order, making it difficult to efficiently match grouping keys. Of course, the input can be sorted, but this is computationally expensive. Building a hash table has a lower computational complexity than sorting and is therefore generally preferred, but this requires collision handling. How does parallelism work together with hash tables? In general, the answer is unfortunately: “Badly”. Hash tables are delicate structures that do not handle parallel modifications well. In this talk we present DuckDB’s highly optimized parallel aggregation capability for fast and scalable summarization. We discuss many of the optimizations in DuckDB’s hash aggregate implementation that allow it to efficiently scale to many groups, rows and threads. Finally, we discuss some ideas for future work.
Hannes Mühleisen is a senior researcher at the Database Architectures group within the Centrum Wiskunde & Informatica (CWI). He is also the co-founder and CEO of DuckDB Labs. He received his PhD at the Freie Universität Berlin in 2013.