Self-Adaptive Gradient Quantization for Geo-Distributed Machine Learning over Heterogeneous and Dynamic Networks

Publisher:
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:
Journal Article
Citation:
IEEE Transactions on Cloud Computing, 2023, 11, (4), pp. 3483-3496
Issue Date:
2023-10-01
Full metadata record
Geo-Distributed Machine Learning (Geo-DML) has been proposed to collaborate geographically dispersed data centers (DCs) and train large scale machine learning (ML) models for various applications. While Geo-DML can achieve excellent performance, it also injects massive data traffic into the Wide Area Networks (WANs) in order to exchange gradients during model training process. Such a huge amount of traffic will not only incur network congestion and prolong the training procedure, but also result in straggler problem when DCs are working in heterogeneous network environments. To alleviate these problems, we propose Self-Adaptive Gradient Quantization (SAGQ) for Geo-DML in this work. In SAGQ, each worker DC adopts specific quantization method based on the heterogeneous and dynamic link bandwidth in order to reduce the communication overhead and balance the communication time among worker DCs. By doing so, SAGQ will speed up the Geo-DML training process without sacrificing the ML model performance. Extensive experiments show that compared with the state-of-The-Art techniques, SAGQ reduces the Wall-clock time spent to train an ML model by 1.13×-21.31×. In addition, SAGQ can also improve the model accuracy by 0.11%-2.27% over baselines.
Please use this identifier to cite or link to this item: