Shredding deeply nested JSON, one vector at a time

Laurens Kuiper

JSON is a popular semi-structured data format. Despite being semi-structured, users often want to analyze it in a structured way, e.g., by analyzing JSON log files to find out what their users are doing. Analytical database systems would be the tool of choice for this, but these systems often cannot process semi-structured data or the nested data such as OBJECTs and ARRAYs found in JSON. DuckDB, however, supports efficient columnar STRUCT and LIST types and, therefore, supports the same nestedness as JSON. Since 0.7.0, DuckDB supports reading JSON files directly as if they were tables, with automatic schema detection. In this talk, I will explain how DuckDB reads JSON and transforms it into vectors for efficient analytics.

Laurens is a PhD Student at the Database Architectures group at CWI in Amsterdam. He is also a Software Developer at DuckDB Labs. His research interests include OLAP systems, specifically graceful performance degradation when data sizes are larger than memory.