PADSYS Lab | Publications

NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2021

Shashank Gugnani, Tianxi Li, Xiaoyi Lu

Abstract

Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.

Conference Proceedings

Date-added: 2021-02-22 19:42:43 +0000
Date-modified: 2021-02-03 04:02:36 +0000
Booktitle: Proceedings of IEEE International Parallel and Distributed Processing Symposium
Crossref: conf/ipdps/2021
Publisher: IEEE Computer Society
Series: IPDPS '21

Cite

Plain text

BibTeX