A paper is accepted in IPDPS 2021: NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics.
Congratulations, Shashank and Tianxi!
[IPDPS'21] NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics
Shashank Gugnani, Tianxi Li, and Xiaoyi Lu.
In Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021.
Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities
to significantly improve the performance of IO-intensive HPC applications.
However, state-of-the-art parallel filesystems can not extract the best
possible performance from fast NVMe SSDs and are not designed for
latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper,
we propose a powerful abstraction called microfs to peel away unnecessary
software layers and eliminate namespace coordination. Building upon this
abstraction, we present the design of NVMe-CR, a scalable ephemeral storage
runtime for clusters with disaggregated compute and storage. NVMe-CR proposes
techniques like metadata provenance, log record coalescing, and logically
isolated shared device access, built around the microfs abstraction, to reduce
the overhead of writing millions of concurrent checkpoint files. NVMe-CR
utilizes high-density all-flash arrays accessible via NVMf to absorb bursty
checkpoint IO and increase the progress rates of HPC applications obliviously.
Using the ECP CoMD application as a use case, results show that on a local
cluster our runtime can achieve near perfect (> 0.96) efficiency at 448
processes. Moreover, our designs can reduce checkpoint overhead by as much as
2x compared to state-of-the- art storage systems.
A paper is accepted in VLDB 2021: Understanding the Idiosyncrasies of Real Persistent Memory.
Congratulations, Shashank and Arjun!
[VLDB'21] Understanding the Idiosyncrasies of Real Persistent Memory
Shashank Gugnani, Arjun Kashyap, and Xiaoyi Lu.
In Proceedings of the VLDB Endowment, the 47th International Conference on Very Large Data Bases (VLDB) 2021.
High capacity persistent memory (PMEM) is finally commercially available in the
form of Intel’s Optane DC Persistent Memory Module (DCPMM). Researchers have
raced to evaluate and understand the performance of DCPMM itself as well as
systems and applications designed to leverage PMEM resulting from over a decade
of research. Early evaluations of DCPMM show that its behavior is more nuanced
and idiosyncratic than previously thought. Several assumptions made about its
performance that guided the design of PMEM-enabled systems have been shown to
be incorrect. Unfortunately, several peculiar performance characteristics of
DCPMM are related to the memory technology (3D-XPoint) used and its internal
architecture. It is expected that other technologies (such as STT-RAM,
memristor, ReRAM, NVDIMM), with highly variable characteristics, will be
commercially shipped as PMEM in the near future. Current evaluation studies
fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how
do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there
is a need for a study which can guide the design of systems and is agnostic to
PMEM technology and internal architecture.
In this paper, we first list and categorize the idiosyncratic behavior of PMEM
by performing targeted experiments with our proposed PMIdioBench benchmark
suite on a real DCPMM platform. Next, we conduct detailed studies to guide the
design of storage systems, considering generic PMEM characteristics. The first
study guides data placement on NUMA systems with PMEM while the second study
guides the design of lock-free data structures, for both eADR- and ADR-enabled
PMEM systems. Our results are often counter-intuitive and highlight the
challenges of system design with PMEM.
My advised Ph.D. student Haiyang Shi has successfully defended his thesis and graduated. Congratulations, Dr. Shi!
Title: Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems
Year and Degree: 2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
- Xiaoyi Lu (Advisor)
- Xiaodong Zhang (Committee Member)
- Christopher Stewart (Committee Member)
- Yang Wang (Committee Member)
Replication has been a cornerstone of reliable distributed storage systems for
years. Replicating data at multiple locations in the system maintains
sufficient redundancy to tolerate individual failures. However, the exploding
volume and speed of data growth let researchers and engineers rethink about
using storage-efficient fault tolerance mechanisms to replace replication in
designing or re-designing reliable distributed storage systems. One promising
alternative of replication is Erasure Coding (EC), which trades off extra
compu- tation for high reliability and availability at a prominently low
storage overhead. Therefore, many existing distributed storage systems (e.g.,
HDFS 3.x, Ceph, QFS, Google Colossus, Facebook f4, and Baidu Atlas) have
started to adopt EC to achieve storage-efficient fault tolerance. However, as
EC introduces extra calculations into systems, there are several crucial
challenges to think through for exploiting EC. Such as how to alleviate its
computation overhead and bring emergent devices and technologies into the
pictures for designing high-performance erasure-coded distributed storage
systems. On modern HPC clusters and data centers, there are many powerful
hardware devices (e.g., CPUs, General-Purpose Graphics Processing Units
(GPGPUs), Field-Programmable Gate Arrays (FPGAs), and Smart Network Interface
Cards (SmartNICs)) that are capable to carry out EC computation. To alleviate
the EC overhead, various hardware-optimized EC schemes have been proposed in
the community to leverage the advanced devices. In this dissertation, we first
introduce a unified multi-rail EC library that enables upper-layer applications
to leverage heterogeneous EC-capable hardware devices to perform EC operations
simultaneously and introduces asynchronous APIs to facilitate overlapping
opportunities between computation and communication.
Our comprehensive explorations illustrate that EC computation is in tight
conjunction with data transmission (for dispatching EC results). Hence, we
think that the EC-capable network devices (e.g., SmartNICs) are more promising
and will be widely-used for EC calculations in designing next-generation
erasure-coded distributed storage systems. Towards fully taking advantage of
the EC offload capability on modern SmartNICs, we propose a tripartite graph
based EC paradigm, which is able to tackle the limitations of
current-generation EC offload schemes, bring more parallelism and overlapping,
fully utilize networked resources, and therefore reduce the EC execution time.
Moreover, we also present a set of coherent in-network EC primitives that can
be easily integrated into existing state-of-the-art EC schemes and utilized in
designing advanced EC schemes to fully leverage the advantages of the coherent
in-network EC capabilities on commodity SmartNICs. Using coherent in-network EC
primitives can further reduce CPU involvement and DMA traffics and thus speed
up erasure-coded distributed storage systems.
To demonstrate the potential performance gains of the proposed designs, we
co-design commonly-used distributed storage systems (i.e., HDFS and Memcached)
with our proposed designs, and thoroughly evaluate the co-designed systems with
Hadoop benchmarks and Yahoo! Cloud Serving Benchmark (YCSB) on small-scale and
production-scale HPC clusters. The evaluations illustrate that erasure-coded
distributed storage systems enhanced with the proposed designs obtain
significant or considerable performance improvement.
Erasure Coding; Distributed Storage System; RDMA; SmartNIC;
A tutorial is accepted in IISWC 2020: Benchmarking and Accelerating Big Data
Systems with RDMA, PMEM, and NVMe-SSD.
Congratulations to Haiyang and Shashank!
[IISWC'20] Benchmarking and Accelerating Big Data Systems with RDMA, PMEM, and NVMe-SSD
Xiaoyi Lu, Haiyang Shi, and Shashank Gugnani.
2020 IEEE International Symposium on Workload Characterization (IISWC), 2020.
The convergence of HPC, Big Data, and Deep Learning is becoming the next
game-changing opportunity. Modern HPC systems and Cloud Computing platforms
have been fueled with the advances in multi-/many-core architectures, Remote
Direct Memory Access (RDMA) enabled high-speed networks, persistent memory
(PMEM), and NVMe-SSDs. However, many Big Data systems and libraries (such as
Hadoop, Spark, Flink, Memcached) have not embraced such technologies fully.
Recent studies have shown that default designs of these components can not
efficiently leverage the advanced features on modern clusters with RDMA, PMEM,
and NVMe-SSD. In this tutorial, we will provide an in-depth overview of the
architectures, programming models, features, and performance characteristics of
RDMA networks, PMEM, and NVMe-SSD. We will examine the challenges in
re-/co-designing communication and I/O components of Big Data systems and
libraries with these emerging technologies. We will provide benchmark-level
studies and system-level (like Hadoop/Spark/TensorFlow/Memcached) case studies
to discuss how to efficiently use these new technologies for real applications.
Two papers are accepted in SC 2020: "INEC: Fast and
Coherent In-Network Erasure Coding" and "RDMP-KV: Designing Remote Direct
Memory Persistence-based Key-Value Stores with PMEM".
Haiyang, Tianxi, Shashank, and Dr. Shankar!
[SC'20] INEC: Fast and Coherent In-Network Erasure Coding
Haiyang Shi and Xiaoyi Lu.
In Proceedings of the 33rd International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. (Acceptance Rate: 22.3%)
Erasure coding (EC) is a promising fault tolerance scheme that has been applied
to many well-known distributed storage systems. The capability of Coherent EC
Calculation and Networking on modern SmartNICs has demonstrated that EC will be
an essential feature of in-network computing. In this paper, we propose a set
of coherent in-network EC primitives, named INEC. Our analyses based on the
proposed $\alpha$-$\beta$ performance model demonstrate that INEC primitives can enable
different kinds of EC schemes to fully leverage the EC offload capability on
modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it
into five state-of-the-art EC schemes. Our experiments show that INEC
primitives significantly reduce 50th , 95th , and 99th percentile latencies,
and accelerate the end-to-end throughput, write, and degraded read performance
of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and
[SC'20] RDMP-KV: Designing Remote Direct Memory Persistence-based Key-Value Stores with PMEM
Tianxi Li*, Dipti Shankar*, Shashank Gugnani, and Xiaoyi Lu.
In Proceedings of the 33rd International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), 2020. (Acceptance Rate:
22.3%, *Co-First Authors)
Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote
Direct Memory Access (RDMA) capable networks. However, existing studies to
combine RDMA and PMEM can not deliver the desired performance due to their
PMEM-oblivious communication protocols. In this paper, we propose novel
PMEM-aware RDMA-based communication protocols for persistent key-value stores,
referred to as Remote Direct Memory Persistence based Key-Value stores (RDMP-
KV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to
‘durably’ store individual key-value objects on PMEM-equipped servers.
RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and
emerging (appliance durability) RDMA-capable interconnects, while ensuring
server scalability through a lightweight consistency scheme. Performance
evaluations show that RDMP-KV can improve the server-side performance with
different persistent key-value storage architectures by up to 22x, as compared
with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show
that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a
recent RDMA-to-PMEM framework by up to 71%.