hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic Compression

Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Zizhe Jian, Xin Liang, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Abstract

As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate hZCCL with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 GB/s, surpassing the conventional DOC workflow by up to 36.53\texttimes . Moreover, our hZCCL-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12\texttimes and 6.77\texttimes compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.

Full text links

External link

Conference Proceedings

Isbn
9798350352917
Publisher
IEEE Press
Doi
10.1109/SC41406.2024.00110
Booktitle
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
Articleno
104
Numpages
15
Location
Atlanta, GA, USA
Series
SC '24

Cite

Plain text

BibTeX