My advised Ph.D. student Haiyang Shi has successfully defended his thesis and graduated. Congratulations, Dr. Shi!
Title: Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems
Year and Degree: 2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
- Xiaoyi Lu (Advisor)
- Xiaodong Zhang (Committee Member)
- Christopher Stewart (Committee Member)
- Yang Wang (Committee Member)
Replication has been a cornerstone of reliable distributed storage systems for years. Replicating data at multiple locations in the system maintains sufficient redundancy to tolerate individual failures. However, the exploding volume and speed of data growth let researchers and engineers rethink about using storage-efficient fault tolerance mechanisms to replace replication in designing or re-designing reliable distributed storage systems. One promising alternative of replication is Erasure Coding (EC), which trades off extra compu- tation for high reliability and availability at a prominently low storage overhead. Therefore, many existing distributed storage systems (e.g., HDFS 3.x, Ceph, QFS, Google Colossus, Facebook f4, and Baidu Atlas) have started to adopt EC to achieve storage-efficient fault tolerance. However, as EC introduces extra calculations into systems, there are several crucial challenges to think through for exploiting EC. Such as how to alleviate its computation overhead and bring emergent devices and technologies into the pictures for designing high-performance erasure-coded distributed storage systems. On modern HPC clusters and data centers, there are many powerful hardware devices (e.g., CPUs, General-Purpose Graphics Processing Units (GPGPUs), Field-Programmable Gate Arrays (FPGAs), and Smart Network Interface Cards (SmartNICs)) that are capable to carry out EC computation. To alleviate the EC overhead, various hardware-optimized EC schemes have been proposed in the community to leverage the advanced devices. In this dissertation, we first introduce a unified multi-rail EC library that enables upper-layer applications to leverage heterogeneous EC-capable hardware devices to perform EC operations simultaneously and introduces asynchronous APIs to facilitate overlapping opportunities between computation and communication.
Our comprehensive explorations illustrate that EC computation is in tight conjunction with data transmission (for dispatching EC results). Hence, we think that the EC-capable network devices (e.g., SmartNICs) are more promising and will be widely-used for EC calculations in designing next-generation erasure-coded distributed storage systems. Towards fully taking advantage of the EC offload capability on modern SmartNICs, we propose a tripartite graph based EC paradigm, which is able to tackle the limitations of current-generation EC offload schemes, bring more parallelism and overlapping, fully utilize networked resources, and therefore reduce the EC execution time. Moreover, we also present a set of coherent in-network EC primitives that can be easily integrated into existing state-of-the-art EC schemes and utilized in designing advanced EC schemes to fully leverage the advantages of the coherent in-network EC capabilities on commodity SmartNICs. Using coherent in-network EC primitives can further reduce CPU involvement and DMA traffics and thus speed up erasure-coded distributed storage systems.
To demonstrate the potential performance gains of the proposed designs, we co-design commonly-used distributed storage systems (i.e., HDFS and Memcached) with our proposed designs, and thoroughly evaluate the co-designed systems with Hadoop benchmarks and Yahoo! Cloud Serving Benchmark (YCSB) on small-scale and production-scale HPC clusters. The evaluations illustrate that erasure-coded distributed storage systems enhanced with the proposed designs obtain significant or considerable performance improvement.
Erasure Coding; Distributed Storage System; RDMA; SmartNIC;