Xiaoyi will serve as a TPC Vice-Chair for IEEE Cloud Summit 2020.
Please submit your papers to IEEE Cloud Summit 2020. Call for Papers
Xiaoyi will serve as a TPC Co-Chair for The 6th IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC 2020).
Please submit your papers to HPBDC 2020. Call for Papers
Xiaoyi will serve as TPCs for the following conferences in 2020!
A paper is accepted in IISWC 2019: SimdHT-Bench: Characterizing SIMD-Aware Hash
Table Designs on Emerging CPU Architectures. This paper got nominated as a
Best Paper Award Candidate.
Congratulations, Dr. Shankar!
Paper Info
[IISWC'19] SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures (Best Paper Award Nomination)
Dipti Shankar, Xiaoyi Lu, and Dhabaleswar K. Panda.
In Proceedings of 2019 IEEE International Symposium on Workload Characterization (IISWC), 2019.
Abstract
With the emergence of modern multi-core CPU architectures that support data
parallelism via vectorization, several storage systems have been employing
SIMD-based techniques to optimize data-parallel operations on in-memory
structures like hash-tables. In this paper, we perform an in-depth
characterization of the opportunities for incorporating AVX vectorization-based
SIMD-aware designs for hash table lookups on emerging CPU architectures. We
analyze the challenges and design dimensions involved in exploiting
vectorization-based parallel key searching over cache-optimized non-SIMD hash
tables. Based on this, we design a comprehensive micro-benchmark suite,
SimdHT-Bench, that enables evaluating the performance and applicability of CPU
SIMD-aware hash table designs for accelerating different read-intensive
workloads. With SimdHT-Bench, we study five different use-case scenarios with
varied workload patterns, on the latest Intel Skylake and Intel Cascade Lake
multi-core CPU nodes. Further, to validate the applicability of SimdHT-Bench,
we employ these performance studies to design a high-performance SIMD-aware
RDMA-based in-memory key-value store to accelerate the Memcached ‘Multi-Get’
workload. We demonstrate that the SIMD-integrated designs can achieve up to
1.45x–2.04x improvement in server-side Get throughput and up to 34% improvement
in end-to-end Multi-Get latencies over the state-of-the-art CPU-optimized
non-SIMD MemC3 hash table design, on a high-performance compute cluster with
Intel Skylake processors and InfiniBand EDR interconnects.
My co-advised Ph.D. student Dipti Shankar has successfully defended her thesis and graduated. Congratulations, Dr. Shankar!
Thesis Info
Title: Designing Fast, Resilient and Heterogeneity-Aware Key-Value Storage on Modern HPC Clusters
Year and Degree: 2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Committee
- Dhabaleswar K. Panda (Advisor)
- Xiaoyi Lu (Co-Advisor)
- Feng Qin (Committee Member)
- Gagan Agrawal (Committee Member)
Abstract
With the recent emergence of in-memory computing for Big Data analytics,
memory-centric and distributed key-value storage has become vital to
accelerating data processing workloads, in high-performance computing (HPC) and
data center environments. This has led to several research works focusing on
advanced key-value store designs with Remote- Direct-Memory-Access (RDMA) and
hybrid `DRAM+NVM’ storage designs. However, these existing designs are
constrained by the blocking store/retrieve semantics; incurring additional
complexity with the introduction of high data availability and durability
requirements. To cater to the performance, scalability, durability and
resilience needs of the diverse key-value store-based workloads (e.g., online
transaction processing, offline data analytics, etc.), it is therefore vital to
fully exploit resources on modern HPC systems. Moreover, to maximize server
scalability and end-to-end performance, it is necessary to focus on designing
an RDMA-aware communication engine that goes beyond optimizing the key-value
store middleware for better client-side latencies.
Towards addressing this, in this dissertation, we present a `holistic approach’
to designing high-performance, resilient and heterogeneity-aware key-value
storage for HPC clusters, that encompasses: (1) RDMA-enabled networking, (2)
high-speed NVMs, (3) emerging byte-addressable persistent memory devices, and,
(4) SIMD-enabled multi-core CPU compute capabilities. We first introduce
non-blocking API extensions to the RDMA- Memcached client, that allows an
application to separate the request issue and completion phases. This
facilitates overlapping opportunities by truly leveraging the one-sided
characteristics of the underlying RDMA communication engine, while conforming
to the basic Set/Get semantics. Secondly, we analyze the overhead of employing
memory-efficient resilience via Erasure Coding (EC), in an online fashion.
Based on this, we extend our proposed RDMA-aware key-value store, that supports
non-blocking API semantics, to enable overlapping the EC encoding/decoding
compute phases with the scatter/gather communication protocol involved in
resiliently storing the distributed key-value data objects.
This work also examines durable key-value store designs for emerging persistent
memory technologies. While RDMA-based protocols employed in existing volatile
DRAM-based key-value stores can be directly leveraged, we find that there is a
need for a more integrated approach to fully exploit the fine-grained
durability of these new byte-addressable storage devices. We propose 'RDMP-KV’,
that employs a hybrid 'server-reply/server- bypass’ approach to 'durably’ store
individual key-value pair objects on the remote persistent memory-equipped
servers via RDMA. RDMP-KV’s runtime can easily adapt to existing
(server-assisted durability) and emerging (appliance durability) RDMA-capable
interconnects, while ensuring server scalability and remote data consistency.
Finally, the thesis explores SIMD-accelerated CPU-centric hash table designs,
that can enable higher server throughput. We propose an end-to-end SIMD-aware
key-value store design, 'SCOR- KV’, which introduces optimistic
'RDMA+SIMD’-aware client-centric request/response offloading protocols. SCOR-KV
can minimize the server-side data processing overheads to achieve better
scalability, without compromising on the client-side latencies.
With this as the basis, we demonstrate the potential performance gains of the
proposed designs with online (e.g, YCSB) and offline (e.g, in-memory and
distributed burst-buffer over Lustre for Hadoop I/O) workloads on small-scale
and production-scale HPC clusters.
Keywords
High-Performance Computing; Key-Value Store; RDMA; Persistent Memory;
source
A paper is accepted in SC 2019: TriEC: Tripartite Graph Based Erasure Coding
NIC Offload. This year, SC only had 72 papers being accepted directly and 15
papers asked for major revisions out of 344 initial submissions. This paper got
accepted directly and nominated as a Best Student Paper (BSP) Finalist.
Congratulations to my student, Haiyang!
Paper Info
[SC'19] TriEC: Tripartite Graph Based Erasure Coding NIC Offload (Best Student Paper Finalist)
Haiyang Shi and Xiaoyi Lu.
In Proceedings of the 32nd International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2019. (Acceptance Rate: 22.7%, 78/344)
Abstract
Erasure Coding (EC) NIC offload is a promising technology for designing
next-generation distributed storage systems. However, this paper has identified
three major limitations of current-generation EC NIC offload schemes on modern
SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the
tripartite graph model, namely TriEC. TriEC supports both encode-and-send and
receive-and-decode operations efficiently. Through theorem-based proofs,
co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we
show that TriEC is correct and can deliver better performance than the
state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations
demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding
and recovering, respectively. With extended YCSB workloads, TriEC reduces the
average write latency by up to 23.2% and the recovery time by up to 37.8%.
TriEC outperforms BiEC by 1.32x for a full-node recovery with 8 million
records.
A paper is accepted in HPDC 2019: UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems. This year, HPDC only accepted 22 papers out of 106. 11 papers have gone through shepherding. This paper got accepted directly. The first author of this paper is one of my Ph.D. students, Haiyang Shi. Congratulations to Haiyang and other co-authors!
Paper Info
[HPDC'19] UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems
Haiyang Shi, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda.
In Proceedings of the 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2019. (Acceptance Rate: 20.7%, 22/106)
Abstract
Distributed storage systems typically need data to be stored redundantly to
guarantee data durability and reliability. While the conventional approach
towards this objective is to store multiple replicas, today’s unprecedented
data growth rates encourage modern distributed storage systems to employ
Erasure Coding (EC) techniques, which can achieve better storage efficiency.
Various hardware-based EC schemes have been proposed in the community to
leverage the advanced compute capabilities on modern data center and cloud
environments. Currently, there is no unified and easy way for distributed
storage systems to fully exploit multiple devices such as CPUs, GPUs, and
network devices (i.e., multi-rail support) to perform EC operations in
parallel; thus, leading to the under-utilization of the available compute
power. In this paper, we first introduce an analytical model to analyze the
design scope of efficient EC schemes in distributed storage systems. Guided by
the performance model, we propose UMR-EC, a Unified and Multi-Rail Erasure
Coding library that can fully exploit heterogeneous EC coders. Our proposed
interface is complemented by asynchronous semantics with optimized
metadata-free scheme and EC rate-aware task scheduling that can enable a
highly-efficient I/O pipeline. To show the benefits and effectiveness of
UMR-EC, we re-design HDFS 3.x write/read pipelines based on the guidelines
observed in the proposed performance model. Our performance evaluations show
that our proposed designs can outperform the write performance of replication
schemes and the default HDFS EC coder by 3.7x - 6.1x and 2.4x - 3.3x,
respectively, and can improve the performance of read with failure recoveries
up to 5.1x compared with the default HDFS EC coder. Compared with the fastest
available CPU coder (i.e., ISA-L), our proposed designs have an improvement of
up to 66.0% and 19.4% for write and read with failure recoveries, respectively.