Three techniques for exploiting string compression in data systems

Peter Boncz (CWI & VU)

Actual data in real-life database often is in the form of strings. Strings take significantly more volume than fixed-size data, causing I/O, network traffic, memory traffic and cache traffic. Further, operations on strings tend to be significantly more expensive than operations on e.g. integers, which CPUs do support quite efficiently (let alone GPUs, TPUs - which even do not acknowledge the existense of string data). Despite this, most academic algorithms and benchmarking focuses on operations on fixed-size data. As such, string processing in data systems deserves more attention and can have significant impact on practice.

In this talk, I will discuss three techniques which can significantly improve the performance of handling large volumes of string data, and discuss how integrating these techniques affects data systems design.

Bio: Peter Boncz holds appointments as tenured researcher at CWI and professor at VU University Amsterdam. His academic background is in core database architecture, with the MonetDB the systems outcome of his PhD – MonetDB much later won the 2016 ACM SIGMOD systems award. He has a track record in bridging the gap between academia and commercial application, receiving the Dutch ICT Regie Award 2006 for his role in the CWI spin-off company Data Distilleries. In 2008 he co-founded Vectorwise around the analytical database system by the same name which pioneered vectorized query execution. He is co-recipient of the 2009 VLDB 10 Years Best Paper Award, and in 2013 received the Humboldt Research Award for his research on database architecture. He also works on graph data management, founding in 2013 the Linked Database Benchmark Council (LDBC), a benchmarking organization for graph database systems.

Slides