Database Schemas in the Wild - What Can We Learn from a Large Corpus of Relational Database Schemas?

Till Döhmen

Tabular data collections, such as GitTables, are important sources of real-world tabular data. They provide training data for table representation learning approaches that advance the state-of-the-art for problems like semantic annotation, data imputation, and automated error detection. However, such datasets are limited to individual tables and do not contain schema information about database constraints (uniqueness, not nulls, etc.) or relationships to other tables. As real-world database schemas are hard to come by - with the largest public repository of databases containing about 150 relational databases - there is a need in the community for a new dataset. Thus, we created GitSchemas, a large corpus of database schema information extracted from SQL scripts in public code repositories, containing highly accurate schema information for more than 150k schemas, 1M tables (including column names, data types, and database constraints), and almost 600k foreign key relationships. We believe that schema information alone (without data) at this scale will be suitable for benchmarking, and improving existing approaches to a variety of relevant data management problems, such as foreign key detection and constraint predictions, while also presenting an opportunity to learn more about how database systems are used in practice.

Till Döhmen is a PhD student at RWTH Aachen University, guest researcher at the UvA Intelligent Data Engineering Lab (INDE Lab), and research engineer at Hopsworks. His research interests lie at the intersection of data management and machine learning systems.