Cost-Efficient Distributed Learning

# 122






Abstract

Topic_abstract: We consider distributed gradient descent as the main block of several machine learning algorithms. Gradient descent is an iterative algorithm that consists of evaluating the gradient of a loss function at a current model and the considered data. A main node sends its data to $n$ worker devices. At each iteration, the workers evaluate the gradient at the data assigned to them and return the result to the main node. The main node aggregate the obtained evaluations and updates the model. This setting is known to suffer from the presence of stragglers, i.e., slow or unresponsive workers, whose delays may outweigh the benefit of distributing the computations. It is shown that under some constraints on the response times of the workers, ignoring the stragglers is equivalent to running stochastic gradient descent (SGD). To mitigate the effect of stragglers, the main node only aggregates the result returned from the fastest $k\leq n$ workers, where $k$ is fixed parameter. The choice of $k$ that can be optimized a priori presents a trade-off between the total run-time of the algorithm and the quality of the obtained model. In this talk, we design a strategy that adaptively increases $k$ throughout the run-time of the algorithm to obtain the best run-time possible with the best quality of the model. The challenge is to derive a relationship between the value of $k$, the time spent per iteration, and the quality of the obtained model. Based on this relationship, we derive the times at which the main node needs to increase $k$. Efficiency here is measured in the number of communicated results. We show that this strategy outperforms strategies with fixed $k$. We then introduce a more cost-efficient strategy that only assigns tasks to $k$ workers per iteration. To learn which $k$ workers are the fastest, we rely on the theory of multi-armed bandits. We briefly explain the theory of multi-armed bandits as a tool of independent interest and design a cost-efficient scheme that assigns computational tasks to the workers while learning their speed. We show that for a fixed cost (measured by the number of worker employment), the introduce strategy outperforms the previous schemes. However, this strategy has a longer run-time due to the overhead of learning the speed of the workers. We validate the results and concretize the theoretical findings through numerical simulations. This talk summarizes results from joint works with Antonia Wachter-Zeh, Deniz Gündüz, Maximilian Egger, Parimal Parag, Salim El Rouyahbe, Serge Kas Hanna and Venkat Dasaari.


Rawad Bitar, TUM

Rawad Bitar is a senior researcher and lecturer at the Technical University of Munich doing a habilitation (accreditation for thesis supervision). He received the Diploma in computer and communication engineering from the Lebanese University, in 2013, the M.S. degree from the Doctoral School of the Lebanese University, in 2014, and the Ph.D. degree in electrical engineering from Rutgers, The State University of New Jersey, in 2020. Before starting the habilitation in 2022, he spent two years as a Post-Doctoral Researcher at the Technical University of Munich. His research interests lie in the broad are of information theory and coding theory with a focus on coding for information theoretically private and secure distributed systems with application to machine learning and coding for DNA-based data storage.