We propose an efficient approximate top-k gradient sparsification algorithm on GPUs to compress the communication data with very slight computation overheads.

We present a novel hierarchical communication algorithm to aggregate sparsified gradients to better utilize the bandwidth on GPU clusters that are with fast interconnects within nodes and slow interconnects between nodes.

We perform distributed training experiments with two types of models, CNNs and Transformer, to demonstrate the effectiveness of the proposed techniques on a Tencent Cloud cluster with 16 nodes connected with 25 Gbps Ethernet (each node has 8 Nvidia Tesla V100- 32GB GPUs with NVLink). Experimental results show that our system achieves 25% − 40% faster than existing state-of-the-art systems.

Last updated