PADSYS Lab | Publications

HPC-R1: Characterizing R1-like Large Reasoning Models on HPC

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025 (Acceptance Rate: 21.2\%)

Adam Weingram, Duo Zhang, Zhonghao Chen, Hao Qi, Xiaoyi Lu

Abstract

Large Reasoning Models (LRMs) are becoming increasingly popular as they offer advanced capabilities in logical inference, mathematical reasoning, and knowledge synthesis, even beyond those of standard language models. However, their complex training workflows present significant challenges in reproducibility, efficiency, and system-level optimization. This paper introduces HPC-R1, a comprehensive characterization of LRM training on the NERSC Perlmutter supercomputer, representing behavior on a Top500-ranked system. We analyze all major stages, including supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO)-based reinforcement learning (RL), autoregressive generation, and distillation using customized state-of-the-art frameworks. Our detailed performance analysis reveals key system inefficiencies and scaling behaviors. Through our in-depth analysis, we present 19 key observations across all stages, including 4 for SFT, 7 for GRPO-based RL, 6 for generation, and 2 for distillation. Based on these findings, we present several key recommendations to guide future HPC-AI system design.

Full text links

External link

Conference Proceedings

Isbn: 9798400714665
Publisher: Association for Computing Machinery
Address: New York, NY, USA
Doi: 10.1145/3712285.3759827
Booktitle: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Pages: 1368–1380
Numpages: 13
Location: St. louis, MO, USA
Series: SC '25
Note: Acceptance Rate: 21.2\%

Cite

Plain text

BibTeX