The Dutch Seminar
on Data Systems Design

An initiative to bring together research groups working on data systems in Dutch universities and research institutes.

Fridays4–5:30 pm
bi-weekly

We hold bi-weekly talks on Fridays from 4:00 PM to 5:30 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high quality international speakers. We would like to invite all researchers, especially also PhD students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on by researchers in your field.

Past talks

Jul 08, 2022

Data Structures on Computational Storage Drives

Ilaria Battiston

This talk will present an introduction to data management systems on Computational Storage Drives, focusing on hardware devices implementing on-the-fly transparent compression. After giving an overview of their functioning and properties, we explore some commonly-used and custom-designed data structures, to understand their related benefits and issues when subject to compression. We then present some ways to optimize and tailor them in order to fully exploit the capabilities of the underneath storage device.

Ilaria is a newly-appointed PhD student in the Database Architecture group at CWI, focusing on OLAP data management techniques.

read more
Jul 08, 2022

Data Science through the Looking Glass and what we found there

Bojan Karlas

The recent success of machine learning (ML) has led to an explosive growth of new systems and methods for supporting the ML development process. This process entails both the data science projects for performing predictive analytics, but also the process of developing learned components that are part of broader software systems. However, this explosive growth poses a real challenge for system builders in their effort to develop the most up-to-date tools that are well integrated with the rest of the ML software stack. A lot of new development is driven by direct customer feedback and anecdotal evidence. Even though this is indeed a valuable source of information, given that we are usually able to survey only a handful of sources, we are often likely to end up with a biased or incomplete picture. To support the next generation of ML development systems, we require certain insights that can only be drawn from a larger scale empirical analysis of the methods and habits of ML practitioners. In this work, we set out to capture this panorama through a wide-angle lens, by performing the largest analysis of data science projects to date. Specifically, we analyze: (a) over 6M Python notebooks publicly available on GitHub, (b) over 2M enterprise DS pipelines developed within Microsoft, and (c) the source code and metadata of over 900 releases from 12 important DS libraries. The analysis we perform ranges from coarse-grained statistical characterizations to analysis of library imports, pipelines, including comparative studies across datasets and time. In this talk, we will go over the findings that we were able to gather from this extensive study. Furthermore, we will cover some key insights and takeaways which could be useful at supporting the design decisions for building next generation ML development systems.

Bojan is a recently graduated PhD from the Systems Group of ETH Zurich advised by Prof. Ce Zhang. His research focuses on discovering systematic methods for managing the machine learning development process. Specifically, he focuses on data debugging, which is the process of performing targeted data quality improvements in order to improve the quality of end-to-end machine learning pipelines. He has done internships at Microsoft, Oracle and Logitech. Before ETH, he obtained his master’s degree at EPFL in Lausanne and his bachelor’s at the University of Belgrade. The coming fall, he will be joining the group of Prof. Kun-Hsing Yu at Harvard Medical School to work on novel methods for debugging ML workflows used in biomedical applications.

read more
Jun 24, 2022

Data Management for Emerging Problems in Large Networks

Arijit Khan

Graphs are widely used in many application domains, including social networks, knowledge graphs, biological networks, software collaboration, geo‐spatial road networks, interactive gaming, among many others. One major challenge for graph querying and mining is that non‐professional users are not familiar with the complex schema and information descriptions. It becomes hard for users to formulate a query (e.g., SPARQL or exact subgraph pattern) that can be properly processed by the existing systems. As an example, Freebase that powers Google’s knowledge graph alone has over 22 million entities and 350 million relationships in about 5428 domains. Before users can query anything meaningful over this data, they are often overwhelmed by the daunting task of attempting to even digest and understand it. Without knowing the exact structure of the data and the semantics of the entity labels and their relationships, can we still query them and obtain the relevant results? In this talk, I shall give an overview of our user‐friendly, embedding‐based, scalable techniques for querying big graphs, including heterogeneous networks. I shall conclude by discussing our newest progress about solving emerging problems on uncertain graphs, graph mining, and machine learning on graphs.

Arijit Khan is an associate professor in the Department of Computer Science, Aalborg University, Denmark. He earned his PhD from the Department of Computer Science, University of California, Santa Barbara, USA, and did a post-doc in the Systems group at ETH Zurich, Switzerland. He has been an assistant professor in the School of Computer Science and Engineering, Nanyang Technological University, Singapore. Arijit is the recipient of the prestigious IBM PhD Fellowship in 2012-13. He published more than 60 papers in premier databases and data mining conferences and journals including ACM SIGMOD, VLDB, IEEE TKDE, IEEE ICDE, SIAM SDM, USENIX ATC, EDBT, The Web Conference (WWW), ACM WSDM, and ACM CIKM. Arijit co-presented tutorials on emerging graph queries and big graph systems at IEEE ICDE 2012, and at VLDB (2017, 2015, and 2014). He served in the program committee of ACM KDD, ACM SIGMOD, VLDB, IEEE ICDE, IEEE ICDM, EDBT, ACM CIKM, and in the senior program committee of WWW. Arijit served as the co-chair of Big-O(Q) workshop co-located with VLDB 2015, wrote a book on uncertain graphs in Morgan & Claypool’s Synthesis Lectures on Data Management. He contributed invited chapters and articles on big graphs querying and mining in the ACM SIGMOD blog, Springer Handbook of Big Data Technologies, and in Springer Encyclopedia of Big Data Technologies. He was invited to give tutorials and talks across 10 countries, including in the National Institute of Informatics(NII) Shonan Meeting on “Graph Database Systems: Bridging Theory, Practice, and Engineering”, 2018, Japan, Asia Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data (APWeb-WAIM 2017), International Conference on Management of Data (COMAD 2016), and in the Dagstuhl Seminar on graph algorithms and systems, 2014 and 2019, Schloss Dagstuhl - Leibniz Center for Informatics, Germany. Dr Khan is serving as an associate editor of IEEE TKDE 2019-now, proceedings chair of EDBT 2020, and IEEE ICDE TKDE poster track co-chair 2023.

read more

Tweets by @DSDSDNL