Open file formats typically use a set of lightweight compression schemes to compress individual columns, taking advantage of data patterns found within the values of each column. However, by compressing columns separately, we do not consider correlations that may exist between columns that may allow us to compress more effectively. Real-world datasets exhibit many such column correlations and research how they can be exploited for compression. In this talk, we introduce C3 (Compressing Correlated Columns), a new compression framework which can exploit correlations between columns for compression. We designed C3 on top of typical lightweight compression infrastructure, and added six new multi-column compression schemes which exploit correlations. We designed our multi-column compression schemes based on correlations we found in real-world datasets, but new compression schemes exploiting other types of correlations can easily be added. C3 uses a sampling-based algorithm to choose the most effective scheme to compress each column. We evaluated the effectiveness of C3 on the Public BI benchmark, containing real-world datasets, and achieved around 20% higher compression ratios compared to using only typical single-column compression schemes.
Thomas Glas is pursuing his master’s degree in computer science at the Technical University of Munich. He joined the Database Architecture Group at CWI in May this year to write his master’s thesis on columnar data compression.