Evaluating Matching Techniques with Valentine

Christos Koutras (Delft University of Technology)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics.

In this talk, I am going to present how we rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery with Valentine: an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. In addition, I will talk about lessons learned from our experimentation with Valentine, and how we designed a scalable system, together with a user-friendly GUI, which facilitates the deployment of Valentine for evaluation and/or finding links among numerous datasets in a data lake. Finally, I am going to briefly discuss future prospects on matching tabular data in data lakes.

Christos is a PhD Candidate at the Web Information Systems group of the Faculty of Engineering, Mathematics and Computer Science at Delft University of Technology. He is supervised by Asterios Katsifodimos and Christoph Lofi. His research is focused on Data Integration, and specifically Schema Matching, while he has also worked on spatial data management. He holds a Master of Philosophy (MPhil) in Computer Science from HKUST, where he was supervised by Prof. Dimitris Papadias. Prior to that, he obtained his 5-year Diploma in Electrical and Computer Engineering from National Technical University of Athens.

Slides