Xiaoyi will serve as TPCs/Chairs for multiple 2020 conferences!

Xiaoyi will serve as a TPC Vice-Chair for IEEE Cloud Summit 2020. Please submit your papers to IEEE Cloud Summit 2020. Call for Papers

Xiaoyi will serve as a TPC Co-Chair for The 6th IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC 2020). Please submit your papers to HPBDC 2020. Call for Papers

Xiaoyi will serve as TPCs for the following conferences in 2020!

SC 2020 (Research Posters Track)
HiPC 2020
CCGrid 2020
TPDS's special section on "Parallel and Distributed Computing Techniques for AI, ML and DL"
Others

SimdHT-Bench paper got accepted in IISWC19 and nominated as a BP Candidate!

A paper is accepted in IISWC 2019: SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures. This paper got nominated as a Best Paper Award Candidate.

Congratulations, Dr. Shankar!

Paper Info

[IISWC'19] SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures (Best Paper Award Nomination)

Dipti Shankar, Xiaoyi Lu, and Dhabaleswar K. Panda.

In Proceedings of 2019 IEEE International Symposium on Workload Characterization (IISWC), 2019.

Abstract

With the emergence of modern multi-core CPU architectures that support data parallelism via vectorization, several storage systems have been employing SIMD-based techniques to optimize data-parallel operations on in-memory structures like hash-tables. In this paper, we perform an in-depth characterization of the opportunities for incorporating AVX vectorization-based SIMD-aware designs for hash table lookups on emerging CPU architectures. We analyze the challenges and design dimensions involved in exploiting vectorization-based parallel key searching over cache-optimized non-SIMD hash tables. Based on this, we design a comprehensive micro-benchmark suite, SimdHT-Bench, that enables evaluating the performance and applicability of CPU SIMD-aware hash table designs for accelerating different read-intensive workloads. With SimdHT-Bench, we study five different use-case scenarios with varied workload patterns, on the latest Intel Skylake and Intel Cascade Lake multi-core CPU nodes. Further, to validate the applicability of SimdHT-Bench, we employ these performance studies to design a high-performance SIMD-aware RDMA-based in-memory key-value store to accelerate the Memcached ‘Multi-Get’ workload. We demonstrate that the SIMD-integrated designs can achieve up to 1.45x–2.04x improvement in server-side Get throughput and up to 34% improvement in end-to-end Multi-Get latencies over the state-of-the-art CPU-optimized non-SIMD MemC3 hash table design, on a high-performance compute cluster with Intel Skylake processors and InfiniBand EDR interconnects.

Dipti got the Ph.D. degree! Congrats!

My co-advised Ph.D. student Dipti Shankar has successfully defended her thesis and graduated. Congratulations, Dr. Shankar!

Thesis Info

Title: Designing Fast, Resilient and Heterogeneity-Aware Key-Value Storage on Modern HPC Clusters

Year and Degree: 2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Committee

Dhabaleswar K. Panda (Advisor)
Xiaoyi Lu (Co-Advisor)
Feng Qin (Committee Member)
Gagan Agrawal (Committee Member)

Abstract

With the recent emergence of in-memory computing for Big Data analytics, memory-centric and distributed key-value storage has become vital to accelerating data processing workloads, in high-performance computing (HPC) and data center environments. This has led to several research works focusing on advanced key-value store designs with Remote- Direct-Memory-Access (RDMA) and hybrid `DRAM+NVM’ storage designs. However, these existing designs are constrained by the blocking store/retrieve semantics; incurring additional complexity with the introduction of high data availability and durability requirements. To cater to the performance, scalability, durability and resilience needs of the diverse key-value store-based workloads (e.g., online transaction processing, offline data analytics, etc.), it is therefore vital to fully exploit resources on modern HPC systems. Moreover, to maximize server scalability and end-to-end performance, it is necessary to focus on designing an RDMA-aware communication engine that goes beyond optimizing the key-value store middleware for better client-side latencies.

Towards addressing this, in this dissertation, we present a `holistic approach’ to designing high-performance, resilient and heterogeneity-aware key-value storage for HPC clusters, that encompasses: (1) RDMA-enabled networking, (2) high-speed NVMs, (3) emerging byte-addressable persistent memory devices, and, (4) SIMD-enabled multi-core CPU compute capabilities. We first introduce non-blocking API extensions to the RDMA- Memcached client, that allows an application to separate the request issue and completion phases. This facilitates overlapping opportunities by truly leveraging the one-sided characteristics of the underlying RDMA communication engine, while conforming to the basic Set/Get semantics. Secondly, we analyze the overhead of employing memory-efficient resilience via Erasure Coding (EC), in an online fashion. Based on this, we extend our proposed RDMA-aware key-value store, that supports non-blocking API semantics, to enable overlapping the EC encoding/decoding compute phases with the scatter/gather communication protocol involved in resiliently storing the distributed key-value data objects.

This work also examines durable key-value store designs for emerging persistent memory technologies. While RDMA-based protocols employed in existing volatile DRAM-based key-value stores can be directly leveraged, we find that there is a need for a more integrated approach to fully exploit the fine-grained durability of these new byte-addressable storage devices. We propose 'RDMP-KV’, that employs a hybrid 'server-reply/server- bypass’ approach to 'durably’ store individual key-value pair objects on the remote persistent memory-equipped servers via RDMA. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability and remote data consistency. Finally, the thesis explores SIMD-accelerated CPU-centric hash table designs, that can enable higher server throughput. We propose an end-to-end SIMD-aware key-value store design, 'SCOR- KV’, which introduces optimistic 'RDMA+SIMD’-aware client-centric request/response offloading protocols. SCOR-KV can minimize the server-side data processing overheads to achieve better scalability, without compromising on the client-side latencies.

With this as the basis, we demonstrate the potential performance gains of the proposed designs with online (e.g, YCSB) and offline (e.g, in-memory and distributed burst-buffer over Lustre for Hadoop I/O) workloads on small-scale and production-scale HPC clusters.

Keywords

High-Performance Computing; Key-Value Store; RDMA; Persistent Memory;

source

TriEC paper got accepted in SC19 and nominated as a BSP Finalist!

A paper is accepted in SC 2019: TriEC: Tripartite Graph Based Erasure Coding NIC Offload. This year, SC only had 72 papers being accepted directly and 15 papers asked for major revisions out of 344 initial submissions. This paper got accepted directly and nominated as a Best Student Paper (BSP) Finalist.

Congratulations to my student, Haiyang!

Paper Info

[SC'19] TriEC: Tripartite Graph Based Erasure Coding NIC Offload (Best Student Paper Finalist)

Haiyang Shi and Xiaoyi Lu.

In Proceedings of the 32nd International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2019. (Acceptance Rate: 22.7%, 78/344)

Abstract

Erasure Coding (EC) NIC offload is a promising technology for designing next-generation distributed storage systems. However, this paper has identified three major limitations of current-generation EC NIC offload schemes on modern SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the tripartite graph model, namely TriEC. TriEC supports both encode-and-send and receive-and-decode operations efficiently. Through theorem-based proofs, co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we show that TriEC is correct and can deliver better performance than the state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding and recovering, respectively. With extended YCSB workloads, TriEC reduces the average write latency by up to 23.2% and the recovery time by up to 37.8%. TriEC outperforms BiEC by 1.32x for a full-node recovery with 8 million records.

UMR-EC paper got accepted in HPDC19!

A paper is accepted in HPDC 2019: UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems. This year, HPDC only accepted 22 papers out of 106. 11 papers have gone through shepherding. This paper got accepted directly. The first author of this paper is one of my Ph.D. students, Haiyang Shi. Congratulations to Haiyang and other co-authors!

Paper Info

[HPDC'19] UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

Haiyang Shi, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda.

In Proceedings of the 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2019. (Acceptance Rate: 20.7%, 22/106)

Abstract

Distributed storage systems typically need data to be stored redundantly to guarantee data durability and reliability. While the conventional approach towards this objective is to store multiple replicas, today’s unprecedented data growth rates encourage modern distributed storage systems to employ Erasure Coding (EC) techniques, which can achieve better storage efficiency. Various hardware-based EC schemes have been proposed in the community to leverage the advanced compute capabilities on modern data center and cloud environments. Currently, there is no unified and easy way for distributed storage systems to fully exploit multiple devices such as CPUs, GPUs, and network devices (i.e., multi-rail support) to perform EC operations in parallel; thus, leading to the under-utilization of the available compute power. In this paper, we first introduce an analytical model to analyze the design scope of efficient EC schemes in distributed storage systems. Guided by the performance model, we propose UMR-EC, a Unified and Multi-Rail Erasure Coding library that can fully exploit heterogeneous EC coders. Our proposed interface is complemented by asynchronous semantics with optimized metadata-free scheme and EC rate-aware task scheduling that can enable a highly-efficient I/O pipeline. To show the benefits and effectiveness of UMR-EC, we re-design HDFS 3.x write/read pipelines based on the guidelines observed in the proposed performance model. Our performance evaluations show that our proposed designs can outperform the write performance of replication schemes and the default HDFS EC coder by 3.7x - 6.1x and 2.4x - 3.3x, respectively, and can improve the performance of read with failure recoveries up to 5.1x compared with the default HDFS EC coder. Compared with the fastest available CPU coder (i.e., ISA-L), our proposed designs have an improvement of up to 66.0% and 19.4% for write and read with failure recoveries, respectively.