When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. Some of these parameters, such as the number of workers, may also change in elastic machine learning scenarios. In current systems, adapting such parameters during training is ill-supported. In this talk, I will describe our recent work on KungFu, a distributed deep learning library for TensorFlow and PyTorch that is designed to enable adaptive and elastic training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators that are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.
[This work has appeared at USENIX OSDI 2020.]
Peter Pietzuch is a Professor of Distributed Systems at Imperial College London, where he leads the Large-scale Data & Systems (LSDS) group (http://lsds.doc.ic.ac.uk). His research work focuses on the design and engineering of scalable, reliable and secure large-scale software systems, with a particular interest in performance, data management and security issues. He has published papers in premier scientific venues, including OSDI/SOSP, SIGMOD, VLDB, ASPLOS, USENIX ATC, EuroSys, SoCC, ICDCS, DEBS, and Middleware. Currently he is a Visiting Researcher with Microsoft Research and serves as the Director of Research in the Department, the Chair of the ACM SIGOPS European Chapter, and an Associate Editor for IEEE TKDE and TCC. Before joining Imperial College London, he was a post-doctoral Fellow at Harvard University. He holds PhD and MA degrees from the University of Cambridge.