Efficient CSV Parsing - On the Complexity of Simple Things

Pedro Holanda

In this talk, we will revisit different CSV parsing implementations in DuckDB and compare them with the current implementation. The bulk of the talk is to discuss the design and implementation decisions in DuckDB’s current CSV Parser. In particular, we will examine the parallel algorithm, the CSV buffer manager, and the transitions of the CSV state machine. Disclaimer: This talk is not for the faint of heart; some very exotically built CSV files will be depicted.

Pedro is an early contributor to DuckDB and currently works as a software engineer at DuckDB Labs, focusing on core and integration aspects of DBMS technology. He completed his PhD at the Database Architectures group at CWI, researching Indexes for Interactive Data Analysis.