# NVMe-CR paper got accepted in IPDPS21!

A paper is accepted in IPDPS 2021: NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics.

Congratulations, Shashank and Tianxi!

## Paper Info

Shashank Gugnani, Tianxi Li, and Xiaoyi Lu.

In Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021.

Abstract

Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.

# A paper got accepted in VLDB 2021!

A paper is accepted in VLDB 2021: Understanding the Idiosyncrasies of Real Persistent Memory.

Congratulations, Shashank and Arjun!

## Paper Info

[VLDB'21] Understanding the Idiosyncrasies of Real Persistent Memory

Shashank Gugnani, Arjun Kashyap, and Xiaoyi Lu.

In Proceedings of the VLDB Endowment, the 47th International Conference on Very Large Data Bases (VLDB) 2021.

Abstract

High capacity persistent memory (PMEM) is finally commercially available in the form of Intel’s Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture.

In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.

pdf

code

# Haiyang got the Ph.D. degree! Congrats!

My advised Ph.D. student Haiyang Shi has successfully defended his thesis and graduated. Congratulations, Dr. Shi!

## Thesis Info

Title: Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems

Year and Degree: 2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Committee

• Xiaodong Zhang (Committee Member)
• Christopher Stewart (Committee Member)
• Yang Wang (Committee Member)

Abstract

Replication has been a cornerstone of reliable distributed storage systems for years. Replicating data at multiple locations in the system maintains sufficient redundancy to tolerate individual failures. However, the exploding volume and speed of data growth let researchers and engineers rethink about using storage-efficient fault tolerance mechanisms to replace replication in designing or re-designing reliable distributed storage systems. One promising alternative of replication is Erasure Coding (EC), which trades off extra compu- tation for high reliability and availability at a prominently low storage overhead. Therefore, many existing distributed storage systems (e.g., HDFS 3.x, Ceph, QFS, Google Colossus, Facebook f4, and Baidu Atlas) have started to adopt EC to achieve storage-efficient fault tolerance. However, as EC introduces extra calculations into systems, there are several crucial challenges to think through for exploiting EC. Such as how to alleviate its computation overhead and bring emergent devices and technologies into the pictures for designing high-performance erasure-coded distributed storage systems. On modern HPC clusters and data centers, there are many powerful hardware devices (e.g., CPUs, General-Purpose Graphics Processing Units (GPGPUs), Field-Programmable Gate Arrays (FPGAs), and Smart Network Interface Cards (SmartNICs)) that are capable to carry out EC computation. To alleviate the EC overhead, various hardware-optimized EC schemes have been proposed in the community to leverage the advanced devices. In this dissertation, we first introduce a unified multi-rail EC library that enables upper-layer applications to leverage heterogeneous EC-capable hardware devices to perform EC operations simultaneously and introduces asynchronous APIs to facilitate overlapping opportunities between computation and communication.

Our comprehensive explorations illustrate that EC computation is in tight conjunction with data transmission (for dispatching EC results). Hence, we think that the EC-capable network devices (e.g., SmartNICs) are more promising and will be widely-used for EC calculations in designing next-generation erasure-coded distributed storage systems. Towards fully taking advantage of the EC offload capability on modern SmartNICs, we propose a tripartite graph based EC paradigm, which is able to tackle the limitations of current-generation EC offload schemes, bring more parallelism and overlapping, fully utilize networked resources, and therefore reduce the EC execution time. Moreover, we also present a set of coherent in-network EC primitives that can be easily integrated into existing state-of-the-art EC schemes and utilized in designing advanced EC schemes to fully leverage the advantages of the coherent in-network EC capabilities on commodity SmartNICs. Using coherent in-network EC primitives can further reduce CPU involvement and DMA traffics and thus speed up erasure-coded distributed storage systems.

To demonstrate the potential performance gains of the proposed designs, we co-design commonly-used distributed storage systems (i.e., HDFS and Memcached) with our proposed designs, and thoroughly evaluate the co-designed systems with Hadoop benchmarks and Yahoo! Cloud Serving Benchmark (YCSB) on small-scale and production-scale HPC clusters. The evaluations illustrate that erasure-coded distributed storage systems enhanced with the proposed designs obtain significant or considerable performance improvement.

Keywords

Erasure Coding; Distributed Storage System; RDMA; SmartNIC;

# A tutorial got accepted in IISWC20!

A tutorial is accepted in IISWC 2020: Benchmarking and Accelerating Big Data Systems with RDMA, PMEM, and NVMe-SSD.

Congratulations to Haiyang and Shashank!

## Tutorial Info

[IISWC'20] Benchmarking and Accelerating Big Data Systems with RDMA, PMEM, and NVMe-SSD

Xiaoyi Lu, Haiyang Shi, and Shashank Gugnani.

2020 IEEE International Symposium on Workload Characterization (IISWC), 2020.

Abstract

The convergence of HPC, Big Data, and Deep Learning is becoming the next game-changing opportunity. Modern HPC systems and Cloud Computing platforms have been fueled with the advances in multi-/many-core architectures, Remote Direct Memory Access (RDMA) enabled high-speed networks, persistent memory (PMEM), and NVMe-SSDs. However, many Big Data systems and libraries (such as Hadoop, Spark, Flink, Memcached) have not embraced such technologies fully. Recent studies have shown that default designs of these components can not efficiently leverage the advanced features on modern clusters with RDMA, PMEM, and NVMe-SSD. In this tutorial, we will provide an in-depth overview of the architectures, programming models, features, and performance characteristics of RDMA networks, PMEM, and NVMe-SSD. We will examine the challenges in re-/co-designing communication and I/O components of Big Data systems and libraries with these emerging technologies. We will provide benchmark-level studies and system-level (like Hadoop/Spark/TensorFlow/Memcached) case studies to discuss how to efficiently use these new technologies for real applications.

source

# Two papers got accepted in SC 2020!

Two papers are accepted in SC 2020: "INEC: Fast and Coherent In-Network Erasure Coding" and "RDMP-KV: Designing Remote Direct Memory Persistence-based Key-Value Stores with PMEM".

Congratulations to Haiyang, Tianxi, Shashank, and Dr. Shankar!

## Paper Info

[SC'20] INEC: Fast and Coherent In-Network Erasure Coding

Haiyang Shi and Xiaoyi Lu.

In Proceedings of the 33rd International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. (Acceptance Rate: 22.3%)

Abstract

Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed $\alpha$-$\beta$ performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th , 95th , and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.

[SC'20] RDMP-KV: Designing Remote Direct Memory Persistence-based Key-Value Stores with PMEM

Tianxi Li*, Dipti Shankar*, Shashank Gugnani, and Xiaoyi Lu.

In Proceedings of the 33rd International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. (Acceptance Rate: 22.3%, *Co-First Authors)

Abstract

Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMP- KV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.