Versatile Datapath Soft Error Detection on the Cheap for HPC Applications

Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024

Yafan Huang, Sheng Di, Zhaorui Zhang, Xiaoyi Lu, Guanpeng Li

Abstract

With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79

Full text links

External link

Conference Proceedings

Isbn
9798350352917
Publisher
IEEE Press
Doi
10.1109/SC41406.2024.00061
Booktitle
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
Articleno
55
Numpages
15
Location
Atlanta, GA, USA
Series
SC '24

Cite

Plain text

BibTeX