In data science, floating-point data is more prominent than in traditional database scenarios. IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as zstd; while [de]compression is much slower than Delta and FOR. We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio. We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals for doubles that originated as decimals, and otherwise uses vectorized compression of the front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, and an efficient two-stage compression algorithm that first samples row-groups and then vectors.
MSc student at VU Amsterdam & UvA. Currently doing research on data compression at CWI in the Database Architectures research group. Former researcher on opinion mining and social networks analysis at ESPOL University (Ecuador). Former data engineer intern at CERN and Amazon EU.