High-Speed Data Communication with Advanced Networks in Large Language Model Training

IEEE Micro, 2024

Liuyao Dai, Hao Qi, Weicong Chen, Xiaoyi Lu

Abstract

Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91

Journal Article

Journal
IEEE Micro
Volume
44
Number
02
Issn
1937-4143
Pages
31-40
Doi
10.1109/MM.2024.3360081
Publisher
IEEE Computer Society
Address
Los Alamitos, CA, USA
Series
IEEE Micro'24
Month
mar

Cite

Plain text

BibTeX