• [HotI '23]  Performance Characterization of Large Language Models on High-Speed Interconnects

    Hao Qi, Liuyao Dai, Weicong Chen, Zhen Jia, Xiaoyi Lu

    Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023

  • [ModSim '23]  LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication

    Liuyao Dai, Adam Weingram, Hao, Qi, Weicong Chen, Xiaoyi Lu

    2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023  (Poster Paper)

  • [IPDPS '23]  SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance

    Weicong Chen, Hao Qi, Xiaoyi Lu, Curtis Tatsuoka

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2023

    Keywords: Group testing, Bayesian, Lattices, Spark, COVID-19

    Abstract: The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9

  • [JCST '23]  xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

    Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, Xiaoyi Lu

    Journal of Computer Science and Technology, 2023  (Invited Paper for the Special Issue in Honor of Professor Kai Hwang’s 80th Birthday)

    Keywords: collective; deep learning; distributed training; GPUDirect; RDMA (remote direct memory access)

    Abstract: Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.

  • [SC '23]  Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters

    Hao Qi, Liuyao Dai, Weicong Chen, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023  (Research Poster Paper)

    Keywords: LLM, RDMA, Performance

  • [SC '23]  Characterizing One-/Two-sided Designs in OpenSHMEM Collectives

    Yuke Li, Yanfei Guo, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023  (Research Poster Paper)

    Keywords: OpenSHMEM, PGAS, Omni-Path, Performance

  • [ModSim '23]  Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation

    Weicong Chen, Yuke Li, Arjun Kashyap, Xiaoyi Lu

    2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023

  • [SC '23]  An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs

    Darren Ng, Charles Parkinson, Andrew Lin, Arjun Kashyap, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023  (Research Poster Paper)

    Keywords: Storage Disaggregation, NVMe-over-Fabric, SPDK

  • [HotI '23]  Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures

    Yuke Li, Arjun Kashyap, Yanfei Guo, Xiaoyi Lu

    Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023

  • [HotInfra '23]  On the Discontinuation of Persistent Memory: Looking Back to Look Forward

    Tianxi Li, Yang Wang, Xiaoyi Lu

    The 1st International Workshop on Hot Topics in System Infrastructure (HotInfra), in conjunction with International Symposium on Computer Architecture (ISCA), 2023

  • [MIT Press '22]  High-Performance Big Data Computing

    Dhabaleswar K. Panda, Xiaoyi Lu, Dipti Shankar

    The MIT Press, 2022

    Abstract: Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.

  • [VLDB '22]  A Study of Database Performance Sensitivity to Experiment Settings

    Yang Wang, Miao Yu, Yujie Hui, Fang Zhou, Yuyang Huang, Rui Zhu, Xueyuan Ren, Tianxi Li, Xiaoyi Lu

    Proc. VLDB Endow., 2022

    Abstract: To allow performance comparison across different systems, our community has developed multiple benchmarks, such as TPC-C and YCSB, which are widely used. However, despite such effort, interpreting and comparing performance numbers is still a challenging task, because one can tune benchmark parameters, system features, and hardware settings, which can lead to very different system behaviors. Such tuning creates a long-standing question of whether the conclusion of a work can hold under different settings.This work tries to shed light on this question by reproducing 11 works evaluated under TPC-C and YCSB, measuring their performance under a wider range of settings, and investigating the reasons for the change of performance numbers. By doing so, this paper tries to motivate the discussion about whether and how we should address this problem. While this paper does not give a complete solution—this is beyond the scope of a single paper, it proposes concrete suggestions we can take to improve the state of the art.

  • [arXiv '22]  Arcadia: A Fast and Reliable Persistent Memory Replicated Log

    Shashank Gugnani, Scott Guthridge, Frank Schmuck, Owen Anderson, Deepavali Bhagwat, Xiaoyi Lu

    CoRR, 2022

  • [HPDC '22]  NVMe-oAF: Towards Adaptive NVMe-oF for IO-Intensive Workloads on HPC Cloud

    Arjun Kashyap, Xiaoyi Lu

    Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022

    Keywords: nvme-over-fabrics, shared memory, hpc cloud, spdk

    Abstract: Applications running inside containers or virtual machines, traditionally use TCP/IP for communication in HPC clouds and data centers. The TCP/IP path usually becomes a major performance bottleneck for applications performing NVMe-over-Fabrics (NVMe-oF) based I/O operations in disaggregated storage settings. We propose an adaptive communication channel, called NVMe-over-Adaptive-Fabric (NVMe-oAF), that applications could leverage to eliminate the high-latency and low-bandwidth incurred by remote I/O requests over TCP/IP. NVMe-oAF accelerates I/O intensive applications using locality awareness along with optimized shared memory and TCP/IP paths. The adaptiveness of the fabric stems from the ability to adaptively select shared memory or TCP channel and further applying optimizations for the chosen channel. To evaluate NVMe-oAF, we co-design Intel's SPDK library with our designs and show up to 7.1x bandwidth improvement and up to 4.2x latency reduction for various workloads over commodity TCP/IP-based Ethernet networks (e.g., 10Gbps, 25Gbps, and 100Gbps). We achieve similar (or sometimes better) performance when compared to NVMe-over-RDMA by avoiding the cumbersome management of RDMA in HPC cloud environments. Finally, we also co-design NVMe-oAF with H5bench to showcase the benefit it brings to HDF5 applications. Our evaluation indicates up to a 7x bandwidth improvement when compared with the network file system (NFS).

  • [HiPC '22]  HiBGT: High-Performance Bayesian Group Testing for COVID-19

    Weicong Chen, Xiaoyi Lu, Curtis Tatsuoka

    Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022  (Acceptance Rate: 26%, 35/131)

  • [Biostatistics '22]  Bayesian Group Testing with Dilution Effects

    Curtis Tatsuoka, Weicong Chen, Xiaoyi Lu

    Biostatistics, 2022  (kxac004)

    Abstract: A Bayesian framework for group testing under dilution effects has been developed, using lattice-based models. This work has particular relevance given the pressing public health need to enhance testing capacity for coronavirus disease 2019 and future pandemics, and the need for wide-scale and repeated testing for surveillance under constantly varying conditions. The proposed Bayesian approach allows for dilution effects in group testing and for general test response distributions beyond just binary outcomes. It is shown that even under strong dilution effects, an intuitive group testing selection rule that relies on the model order structure, referred to as the Bayesian halving algorithm, has attractive optimal convergence properties. Analogous look-ahead rules that can reduce the number of stages in classification by selecting several pooled tests at a time are proposed and evaluated as well. Group testing is demonstrated to provide great savings over individual testing in the number of tests needed, even for moderately high prevalence levels. However, there is a trade-off with higher number of testing stages, and increased variability. A web-based calculator is introduced to assist in weighing these factors and to guide decisions on when and how to pool under various conditions. High-performance distributed computing methods have also been implemented for considering larger pool sizes, when savings from group testing can be even more dramatic.

  • [TBench '22]  Understanding hot interconnects with an extensive benchmark survey

    Yuke Li, Hao Qi, Gang Lu, Feng Jin, Yanfei Guo, Xiaoyi Lu

    BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2022

    Keywords: Benchmarks, Interconnects, RDMA

    Abstract: Understanding the designs and performance characterizations of hot interconnects on modern data center and high-performance computing (HPC) clusters is a fruitful research topic in recent years. The rapid and continuous growth of high-bandwidth and low-latency communication requirements for various types of data center and HPC applications (such as big data, deep learning, and microservices) has been pushing the envelope of advanced interconnect designs. We believe this is high time to investigate the performance characterizations of representative hot interconnects with different benchmarks. Hence, this paper presents an extensive survey of state-of-the-art hot interconnects on data center and HPC clusters and the associated representative benchmarks to help the community to better understand modern interconnects. In addition, we characterize these interconnects by the related benchmarks under different application scenarios. We provide our perspectives on benchmarking data center interconnects based on our survey, experiments, and results.

  • [Bench '22]  Benchmarking Object Detection Models with Mummy Nuts Dataset

    Darren Ng, Colin Schmierer, Andrew Lin, Zeyu Liu, Falin Yu, Shawn Newsam, Reza Ehsani, Xiaoyi Lu

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2022

  • [IPDPS '21]  NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics

    Shashank Gugnani, Tianxi Li, Xiaoyi Lu

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2021

    Keywords: Checkpoint/Restart, NVMe, NVMf, Exascale

    Abstract: Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.

  • [WORDS '21]  Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap

    Xiaoyi Lu, Arjun Kashyap

    The Second Workshop On Resource Disaggregation and Serverless (WORDS’21), co-located with ASPLOS 2021, 2021  (Vision Paper)

    Keywords: Offloadable, Migratable, Microservice, Disaggregated Architectures

  • [HPDC '21]  DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM

    Shashank Gugnani, Xiaoyi Lu

    Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2021

    Keywords: Tailless, Quiescent-Free, Object Store, PMEM

  • [VLDB '21]  Understanding the Idiosyncrasies of Real Persistent Memory

    Shashank Gugnani, Arjun Kashyap, Xiaoyi Lu

    Proceedings of the VLDB Endowment, 2021

    Abstract: High capacity persistent memory (PMEM) is finally commercially available in the form of Intel’s Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture. In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.

  • [SC '21]  HatRPC: Hint-Accelerated Thrift RPC over RDMA

    Tianxi Li, Haiyang Shi, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021

  • [SEC '21]  Characterizing and Accelerating End-to-End EdgeAI Inference Systems for Object Detection Applications

    Yujie Hui, Jeffrey Lien, Xiaovi Lu

    2021 IEEE/ACM Symposium on Edge Computing (SEC), 2021

  • [Bench '20]  Impact of Commodity Networks on Storage Disaggregation with NVMe-oF

    Arjun Kashyap, Shashank Gugnani, Xiaoyi Lu

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2020

  • [SC '20]  RDMP-KV: Designing Remote Direct Memory Persistence Based Key-Value Stores with PMEM

    Tianxi Li, Dipti Shankar, Shashank Gugnani, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

    Keywords: next generation networking, data storage systems, persistent memory, key-value stores, RDMA

    Abstract: Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as <u>R</u>emote <u>D</u>irect <u>M</u>emory <u>P</u>ersistence based <u>K</u>ey-<u>V</u>alue stores (RDMP-KV). RDMP-KV employs a hybrid 'server-reply/server-bypass' approach to 'durably' store individual key-value objects on PMEM-equipped servers. RDMP-KV's runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-'Server-Reply' protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65

  • [JCST '20]  CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

    Zheng-Hao Jin, Haiyang Shi, Ying-Xin Hu, Li Zha, Xiaoyi Lu

    Journal of Computer Science and Technology, 2020

    Keywords: CirroData;high performance;SQL-on-Hadoop;online analytical processing (OLAP);Big Data

  • [SC '20]  INEC: Fast and Coherent in-Network Erasure Coding

    Haiyang Shi, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

    Keywords: erasure coding, fault tolerance, in-network computing, next generation networking

    Abstract: Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57

  • [Bench '19]  Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads.  (Best Paper Award Nomination)

    Yujie Hui, Jeffrey Lien, Xiaoyi Lu

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2019

    Keywords: dblp

  • [NVMW '19]  Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence

    Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of Annual Non-Volatile Memories Workshop, 2019  (Poster Paper)

    Keywords: Non-Volatile Memory, In-Memory Datastore, Remote Direct Memory Persistence

  • [SC '19]  Three-Dimensional Characterization on Edge AI Processors with Object Detection Workloads

    Yujie Hui, Jeffrey Lien, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019  (Poster Paper)

    Keywords: Edge Computing, AI, Object Detection

  • [TPDS '19]  Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast

    Ching-Hsiang Chu, Xiaoyi Lu, Ammar Awan, Hari Subramoni, Bracy Elton, Dhabaleswar K. Panda

    IEEE Transactions on Parallel and Distributed Systems, 2019

    Keywords: graphics processing units;learning (artificial intelligence);message passing;neural nets;parallel processing;performance evaluation;model-oriented analysis;GPU clusters;streaming-based broadcast schemes;InfiniBand hardware multicast;IB-MCAST;NVIDIA GPUDirect technology;streaming learning applications;message transmission;deep learning;high-performance computing;graphics processing unit-based applications;message passing interface;remote direct memory access technology;GPUDirect RDMA technology;HPC clusters;performance evaluation;Graphics processing units;Hardware;Analytical models;Machine learning;Clustering algorithms;Scalability;Bandwidth;Broadcast;deep learning;hardware multicast;GPU;GPUDirect RDMA;heterogeneous broadcast;streaming

    Abstract: Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.

  • [THPC '19]  Performance analysis of deep learning workloads using roofline trajectories

    M. Haseeb Javed, Khaled Z. Ibrahim, Xiaoyi Lu

    CCF Transactions on High Performance Computing, 2019  (Invited Paper)

    Abstract: Over the last decade, technologies derived from convolutional neural networks (CNNs) called Deep Learning applications, have revolutionized fields as diverse as cancer detection, self-driving cars, virtual assistants, etc. However, many users of such applications are not experts in Machine Learning itself. Consequently, there is limited knowledge among the community to run such applications in an optimized manner. The performance question for Deep Learning applications has typically been addressed by employing bespoke hardware (e.g., GPUs) better suited for such compute-intensive operations. However, such a degree of performance is only accessibly at increasingly high financial costs leaving only big corporations and governments with resources sufficient enough to employ them at a large scale. As a result, an average user is only left with access to commodity clusters with, in many cases, only CPUs as the sole processing element. For such users to make effective use of resources at their disposal, concerted efforts are necessary to figure out optimal hardware and software configurations. This study is one such step in this direction as we use the Roofline model to perform a systematic analysis of representative CNN models and identify opportunities for black box and application-aware optimizations. Using the findings from our study, we are able to obtain up to 3.5\$\$\times \$\$×speedup compared to vanilla TensorFlow with default configurations.

  • [IISWC '19]  SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures.  (Best Paper Award Nomination)

    Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. D. K. Panda

    Proceedings of IEEE International Symposium on Workload Characterization, 2019

    Keywords: dblp

  • [SC '19]  TriEC: Tripartite Graph Based Erasure Coding NIC Offload  (Best Student Paper Finalist)

    Haiyang Shi, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019

    Keywords: NIC offload, tripartite, bipartite, erasure coding

    Abstract: Erasure Coding (EC) NIC offload is a promising technology for designing next-generation distributed storage systems. However, this paper has identified three major limitations of current-generation EC NIC offload schemes on modern SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the tripartite graph model, namely TriEC. TriEC supports both encode-and-send and receive-and-decode operations efficiently. Through theorem-based proofs, co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we show that TriEC is correct and can deliver better performance than the state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding and recovering, respectively. With extended YCSB workloads, TriEC reduces the average write latency by up to 23.2

  • [HPDC '19]  UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems.

    Haiyang Shi, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2019

    Keywords: dblp

  • [HiPC '19]  SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures.

    Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2019

    Keywords: dblp

  • [SC '19]  Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems

    Haiyang Shi, Xiaoyi Lu

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019  (ACM Student Research Competition Poster)

    Keywords: Erasure Coding, Storage Systems, SmartNIC, GPGPU

  • [IPDPS '19]  C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.

    Jie Zhang, Xiaoyi Lu, Ching-Hsiang Chu, Dhabaleswar K. Panda

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2019

    Keywords: dblp

  • [HiPC '18]  OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

    Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2018

    Keywords: dblp

  • [FITEE '18]  Networking and communication challenges for post-exascale systems

    Dhabaleswar Panda, Xiaoyi Lu, Hari Subramoni

    Frontiers of Information Technology & Electronic Engineering, 2018  (Invited Vision Paper)

    Abstract: With the significant advancement in emerging processor, memory, and networking technologies, exascale systems will become available in the next few years (2020–2022). As the exascale systems begin to be deployed and used, there will be a continuous demand to run next-generation applications with finer granularity, finer time-steps, and increased data sizes. Based on historical trends, next-generation applications will require postexascale systems during 2025–2035. In this study, we focus on the networking and communication challenges for post-exascale systems. Firstly, we present an envisioned architecture for post-exascale systems. Secondly, the challenges are summarized from different perspectives: heterogeneous networking technologies, high-performance communication and synchronization protocols, integrated support with accelerators and field-programmable gate arrays, fault-tolerance and quality-of-service support, energy-aware communication schemes and protocols, softwaredefined networking, and scalable communication protocols with heterogeneous memory and storage. Thirdly, we present the challenges in designing efficient programming model support for high-performance computing, big data, and deep learning on these systems. Finally, we emphasize the critical need for co-designing runtime with upper layers on these systems to achieve the maximum performance and scalability.

  • [TMSCS '18]  DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters

    Xiaoyi Lu, Haiyang Shi, Rarjashi Biswas, M. Haseeb Javed, Dhabaleswar K. Panda

    IEEE Transactions on Multi-Scale Computing Systems, 2018

    Keywords: machine learning;big data;graphics processing units;training data;parallel processing;deep learning

  • [Bench '18]  HPC AI500: A Benchmark Suite for HPC AI Systems.

    Zihan Jiang, Wanling Gao, Lei Wang, Xingwang Xiong, Yuchen Zhang, Xu Wen, Chunjie Luo, Hainan Ye, Xiaoyi Lu, Yunquan Zhang, Shengzhong Feng, Kenli Li, Weijia Xu, Jianfeng Zhan

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

    Keywords: dblp

  • [BigData '18]  Spark-uDAPL: Cost-Saving Big Data Analytics on Microsoft Azure Cloud with RDMA Networks*.

    Xiaoyi Lu, Dipti Shankar, Haiyang Shi, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2018  (Short Paper)

    Keywords: dblp

  • [BPOE '18]  Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

    Rajarshi Biswas, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2018

  • [EuroMPI '18]  Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.

    Mingzhe Li, Xiaoyi Lu, Hari Subramoni, Dhabaleswar K. Panda

    Proceedings of International Conference on EuroMPI, 2018

    Keywords: dblp

  • [ISC '18]  Can Unified-Memory Support on Pascal and Volta GPUs enable Out-of-Core DNN Training?  (Best Poster Paper Award)

    Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of the International Supercomputing Conference, 2018  (Poster Paper)

    Keywords: Unified Memory, Pascal, Volta, Out-of-Core DNN Training

  • [Bench '18]  A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?

    Qin Zhang, Li Zha, Jian Lin, Dandan Tu, Mingzhe Li, Fan Liang, Ren Wu, Xiaoyi Lu

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

    Keywords: dblp

  • [JPDC '18]  MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters

    Md. Wasi-ur- Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. (DK) Panda

    Journal of Parallel and Distributed Computing, 2018

    Keywords: MapReduce, HPC, Tuning, Prediction

    Abstract: MapReduce is the most popular parallel computing framework for big data processing which allows massive scalability across distributed computing environment. Advanced RDMA-based design of Hadoop MapReduce has been proposed that alleviates the performance bottlenecks in default Hadoop MapReduce by leveraging the benefits from RDMA. On the other hand, data processing engine, Spark, provides fast execution of MapReduce applications through in-memory processing. Performance optimization for these contemporary big data processing frameworks on modern High-Performance Computing (HPC) systems is a formidable task because of the numerous configuration possibilities in each of them. In this paper, we propose MR-Advisor, a comprehensive tuning, profiling, and prediction tool for MapReduce. MR-Advisor is generalized to provide performance optimizations for Hadoop, Spark, and RDMA-enhanced Hadoop MapReduce designs over different file systems such as HDFS, Lustre, and Tachyon. Performance evaluations reveal that, with MR-Advisor’s suggested values, the job execution performance can be enhanced by a maximum of 58

  • [UCC '18]  Analyzing, Modeling, and Provisioning QoS for NVMe SSDs.

    Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2018

    Keywords: dblp

  • [NVMW '18]  Accelerating MapReduce and DAG Execution Frameworks with Non-Volatile Memory and RDMA

    Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda

    Proceedings of Annual Non-Volatile Memories Workshop, 2018  (Poster Paper)

    Keywords: Non-Volatile Memory, RDMA, MapReduce, DAG

  • [Bench '18]  EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures.  (Best Paper Award)

    Haiyang Shi, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

    Keywords: dblp

  • [HiPC '18]  Accelerating TensorFlow with Adaptive RDMA-Based gRPC.

    Rajarshi Biswas, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2018

    Keywords: dblp

  • [SoCC '18]  High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences.

    Haiyang Shi, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of ACM Symposium on Cloud Computing, 2018  (Poster Paper)

    Keywords: dblp

  • [CLUSTER '18]  Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing.

    M. Haseeb Javed, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2018

    Keywords: dblp

  • [HotI '17]  Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-Capable Networks.

    Xiaoyi Lu, Haiyang Shi, M. Haseeb Javed, Rajarshi Biswas, Dhabaleswar K. Panda

    Proceedings of Annual Symposium on High-Performance Interconnects, 2017

    Keywords: dblp

  • [ICDCS '17]  High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads.

    Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Distributed Computing Systems, 2017

    Keywords: dblp

  • [TPDS '17]  A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters

    Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

    IEEE Transactions on Parallel and Distributed Systems, 2017

    Keywords: Big Data;distributed databases;network operating systems;parallel architectures;resource allocation;storage management;storage media;workstation clusters;intermediate data placement;HPC clusters;high performance interconnects;parallel file systems;high performance computing clusters;data analytics;Big Data;HPC technologies;local storage media;Lustre-based global storage;MapReduce over Lustre deployments;high-performance YARN MapReduce design;storage provider;intermediate data storage;priority directory selection;RDMA-enhanced MapReduce;shuffle-intensive workloads;leadership-class HPC systems;job execution;Computer architecture;Servers;High performance computing;Data analysis;Big data;Memory;Big data;high performance computing;RDMA;MapReduce;lustre

    Abstract: With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustrebased global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMAenhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.

  • [BigData '17]  Performance characterization and acceleration of big data workloads on OpenPOWER system.

    Xiaoyi Lu, Haiyang Shi, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2017

    Keywords: dblp

  • [HiPC '17]  MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI.

    Shashank Gugnani, Xiaoyi Lu, Franco Pestilli, Cesar F. Caiafa, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2017

    Keywords: dblp

  • [BDCAT '17]  Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka.

    M. Haseeb Javed, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2017

    Keywords: dblp

  • [SC '17]  Scalable reduction collectives with data partitioning-based multi-leader design.

    Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2017

    Keywords: dblp

  • [VEE '17]  Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand.

    Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2017

    Keywords: dblp

  • [CLUSTER '17]  A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.

    Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2017  (Short Paper)

    Keywords: dblp

  • [BigData '17]  NVMD: Non-volatile memory assisted design for accelerating MapReduce and DAG execution frameworks on HPC systems.

    Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2017  (Short Paper)

    Keywords: dblp

  • [NVMW '17]  NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics

    Xiaoyi Lu, Nusrat Islam, Md. Wasi-ur-Rahman, Dhabaleswar K. Panda

    Proceedings of Annual Non-Volatile Memories Workshop, 2017  (Position Paper)

    Keywords: Non-Volatile Memory, RDMA, Big Data

  • [Chapter '17]  Building Efficient HPC Cloud with SR-IOV-Enabled InfiniBand: The MVAPICH2 Approach.

    Xiaoyi Lu, Jie Zhang, Dhabaleswar K. Panda

    Research Advances in Cloud Computing, 2017

    Keywords: dblp

  • [UCC '17]  Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?  (Best Student Paper Award)

    Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017

    Keywords: dblp

  • [BigData '17]  Characterizing and accelerating indexing techniques on distributed ordered tables.

    Shashank Gugnani, Xiaoyi Lu, Houliang Qi, Li Zha, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2017

    Keywords: dblp

  • [ICPP '17]  Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

    Ching-Hsiang Chu, Xiaoyi Lu, Ammar Ahmad Awan, Hari Subramoni, Jahanzeb Maqbool Hashmi, Bracy Elton, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2017

    Keywords: dblp

  • [TCDE '17]  Scalable and Distributed Key-Value Store-based Data Management Using RDMA-Memcached.

    Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

    IEEE Data Eng. Bull., Bulletin of the Technical Committee on Data Engineering, 2017  (Invited Paper)

    Keywords: dblp

  • [CCGrid '17]  Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud.

    Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2017

    Keywords: dblp

  • [UCC '17]  HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications.

    Dhabaleswar K. Panda, Xiaoyi Lu

    Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017

    Keywords: dblp

  • [HiPC '17]  Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.

    Mingzhe Li, Xiaoyi Lu, Hari Subramoni, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2017

    Keywords: dblp

  • [IPDPS '17]  High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters.

    Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2017

    Keywords: dblp

  • [BDCAT '16]  Performance characterization of hadoop workloads on SR-IOV-enabled virtualized InfiniBand clusters.

    Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2016

    Keywords: dblp

  • [SBAC-PAD '16]  MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers.

    Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of International Symposium on Computer Architecture and High Performance Computing, 2016

    Keywords: dblp

  • [BigData '16]  High-performance design of apache spark with RDMA and its benefits on various workloads.

    Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2016

    Keywords: dblp

  • [IPDPS '16]  High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits.

    Dipti Shankar, Xiaoyi Lu, Nusrat S. Islam, Md. Wasi ur Rahman, Dhabaleswar K. Panda

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016

    Keywords: dblp

  • [CloudCom '16]  Designing Virtualization-Aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-Enabled Clouds.

    Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016

    Keywords: dblp

  • [Euro-Par '16]  Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem.

    Jie Zhang, Xiaoyi Lu, Sourav Chakraborty, Dhabaleswar K. Panda

    Proceedings of International European Conference on Parallel Processing, 2016

    Keywords: dblp

  • [ICPP '16]  High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters.

    Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2016

    Keywords: dblp

  • [SC '16]  Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.

    Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Hari Subramoni, Jie Zhang, Dhabaleswar K. Panda

    Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2016

    Keywords: dblp

  • [ICS '16]  High Performance Design for HDFS with Byte-Addressability of NVM and RDMA.

    Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of International Conference on Supercompuing, 2016

    Keywords: dblp

  • [JSC '16]  Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

    Dipti Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda

    The Journal of Supercomputing, 2016

    Abstract: With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15

  • [BPOE '16]  Characterizing Cloudera Impala Workloads with BigDataBench on InfiniBand Clusters

    Kunal Kulkarni, Xiayi Lu, Dhabaleswar K. Panda

    Proceedings of The 7th International Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7), 2016

  • [BigData '16]  Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O.

    Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2016  (Short Paper)

    Keywords: dblp

  • [CloudCom '16]  Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.

    Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Hari Subramoni, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016

    Keywords: dblp

  • [XSEDE '16]  Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet.

    Mahidhar Tatineni, Xiaoyi Lu, Dong Ju Choi, Amitava Majumdar, Dhabaleswar K. Panda

    Proceedings of Annual Conference on Extreme Science and Engineering Discovery Environment, 2016

    Keywords: dblp

  • [BigData '16]  Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage.

    Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2016

    Keywords: dblp

  • [Chapter '16]  Accelerating Big Data Processing on Modern HPC Clusters

    Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar, Dhabaleswar K. (DK) Panda

    Conquering Big Data with High Performance Computing, 2016

    Abstract: Modern HPC systems and the associated middleware (such as MPI and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architecture, RDMA-enabled networking, and SSD) for many years. However, Big Data processing and management middleware have not fully taken advantage of such technologies. These disparities are taking HPC and Big Data processing into divergent trajectories. This chapter provides an overview of popular Big Data processing middleware, high-performance interconnects and storage architectures, and discusses the challenges in accelerating Big Data processing middleware by leveraging emerging technologies on modern HPC clusters. This chapter presents case studies of advanced designs based on RDMA and heterogeneous storage architecture, that were proposed to address these challenges for multiple components of Hadoop (HDFS and MapReduce) and Spark. The advanced designs presented in the case studies are publicly available as a part of the High-Performance Big Data (HiBD) project. An overview of the HiBD project is also provided in this chapter. All these works aim to bring HPC and Big Data processing into a convergent trajectory.

  • [IPDPS Workshops '16]  Performance Characterization of Hypervisor-and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters.

    Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

    Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016

    Keywords: dblp

  • [ISC '16]  INAM2: InfiniBand Network Analysis and Monitoring with MPI.

    Hari Subramoni, Albert Mathews Augustine, Mark Daniel Arnold, Jonathan L. Perkins, Xiaoyi Lu, Khaled Hamidouche, Dhabaleswar K. Panda

    Proceedings of the International Supercomputing Conference, 2016

    Keywords: dblp

  • [HiPC '16]  Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA.

    Mingzhe Li, Xiaoyi Lu, Khaled Hamidouche, Jie Zhang, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2016

    Keywords: dblp

  • [PDSW-DISCS '16]  Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?

    Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, 2016

    Keywords: dblp

  • [CCGRID '15]  MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds.

    Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015

    Keywords: dblp

  • [CLUSTER '15]  High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.

    Mingzhe Li, Hari Subramoni, Khaled Hamidouche, Xiaoyi Lu, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2015

    Keywords: dblp

  • [CCGRID '15]  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture.

    Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015

    Keywords: dblp

  • [BigData '15]  Benchmarking key-value stores on high-performance storage and interconnects for web-scale workloads.

    Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2015  (Short Paper)

    Keywords: dblp

  • [ICDCS '15]  Accelerating Apache Hive with MPI for Data Warehouse Systems.

    Lu Chao, Chundian Li, Fan Liang, Xiaoyi Lu, Zhiwei Xu

    Proceedings of IEEE International Conference on Distributed Computing Systems, 2015

    Keywords: dblp

  • [BigDataService '15]  Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework.

    Jian Lin, Fan Liang, Xiaoyi Lu, Li Zha, Zhiwei Xu

    Proceedings of IEEE International Conference on Big Data Computing Service and Applications, 2015  (Short Paper)

    Keywords: dblp

  • [JCST '15]  Accelerating Iterative Big Data Computing Through MPI

    Fan Liang, Xiaoyi Lu

    Journal of Computer Science and Technology, 2015

    Abstract: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

  • [IPDPS Workshops '15]  High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation.

    Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, Dhabaleswar K. Panda

    Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015

    Keywords: dblp

  • [OpenSHMEM '15]  Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM.

    Jian Lin, Khaled Hamidouche, Jie Zhang, Xiaoyi Lu, Abhinav Vishnu, Dhabaleswar K. Panda

    Proceedings of Workshop on OpenSHMEM and Related Technologies, 2015

    Keywords: dblp

  • [Euro-Par '15]  High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters.

    Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jian Lin, Dhabaleswar K. Panda

    Proceedings of International European Conference on Parallel Processing, 2015

    Keywords: dblp

  • [HiPC '15]  High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR.

    Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jie Zhang, Jian Lin, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2015

    Keywords: dblp

  • [BPOE '15]  A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS.

    Adithya Bhat, Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2015

    Keywords: dblp

  • [IPDPS '15]  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA.

    Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015

    Keywords: dblp

  • [ICPP '15]  Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store.

    Nusrat Sharmin Islam, Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2015

    Keywords: dblp

  • [ISPASS '15]  Can RDMA benefit online data processing workloads on memcached and MySQL?

    Dipti Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

    Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2015  (Poster Paper)

    Keywords: dblp

  • [BigData '15]  Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters.

    Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2015

    Keywords: dblp

  • [BigData '14]  In-memory I/O and replication for HDFS with Memcached: Early experiences.

    Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Big Data, 2014  (Short Paper)

    Keywords: dblp

  • [PGAS '14]  Scalable MiniMD Design with Hybrid MPI and OpenSHMEM.

    Mingzhe Li, Jian Lin, Xiaoyi Lu, Khaled Hamidouche, Karen Tomko, Dhabaleswar K. Panda

    Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014

    Keywords: dblp

  • [PPoPP '14]  Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems.

    Miao Luo, Xiaoyi Lu, Khaled Hamidouche, Krishna Chaitanya Kandalla, Dhabaleswar K. Panda

    Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014  (Poster Paper)

    Keywords: dblp

  • [Euro-Par '14]  Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?

    Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, Dhabaleswar K. Panda

    Proceedings of International European Conference on Parallel Processing, 2014

    Keywords: dblp

  • [HiPC '14]  High performance MPI library over SR-IOV enabled infiniband clusters.

    Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on High Performance Computing, 2014

    Keywords: dblp

  • [NAS '14]  Performance Characterization of Hadoop and Data MPI Based on Amdahl's Second Law.

    Fan Liang, Chen Feng, Xiaoyi Lu, Zhiwei Xu

    Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2014

    Keywords: dblp

  • [HotI '14]  Accelerating Spark with RDMA for Big Data Processing: Early Experiences.

    Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dipti Shankar, Dhabaleswar K. Panda

    Proceedings of Annual Symposium on High-Performance Interconnects, 2014

    Keywords: dblp

  • [CLUSTER '14]  High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design.  (Best Paper Award Nomination)

    Jithin Jose, Khaled Hamidouche, Xiaoyi Lu, Sreeram Potluri, Jie Zhang, Karen Tomko, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2014

    Keywords: dblp

  • [HPDC '14]  SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS.

    Nusrat S. Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

    Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2014  (Short Paper)

    Keywords: dblp

  • [ICPP '14]  HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters.

    Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jie Zhang, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2014

    Keywords: dblp

  • [BPOE '14]  On Big Data Benchmarking

    Rui Han, Xiaoyi Lu, Jiangtao Xu

    Proceedings of Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

    Abstract: Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4 V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.

  • [BPOE '14]  Performance Benefits of DataMPI: A Case Study with BigDataBench.

    Fan Liang, Chen Feng, Xiaoyi Lu, Zhiwei Xu

    Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

    Keywords: dblp

  • [ICPP '14]  Performance Modeling for RDMA-Enhanced Hadoop MapReduce.

    Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2014

    Keywords: dblp

  • [CLUSTER '14]  Scalable Graph500 design with MPI-3 RMA.

    Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jithin Jose, Karen Tomko, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2014

    Keywords: dblp

  • [Euro-Par '14]  MapReduce over Lustre: Can RDMA-Based Approach Benefit?

    Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

    Proceedings of International European Conference on Parallel Processing, 2014

    Keywords: dblp

  • [IPDPS '14]  DataMPI: Extending MPI to Hadoop-Like Big Data Computing.

    Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu

    Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2014

    Keywords: dblp

  • [ICS '14]  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects.

    Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Dhabaleswar K. Panda

    Proceedings of International Conference on Supercompuing, 2014

    Keywords: dblp

  • [BPOE '14]  A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks.

    Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

    Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

    Keywords: dblp

  • [PGAS '14]  Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.

    Jithin Jose, Sreeram Potluri, Hari Subramoni, Xiaoyi Lu, Khaled Hamidouche, Karl W. Schulz, Hari Sundar, Dhabaleswar K. Panda

    Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014

    Keywords: dblp

  • [WBDB '13]  A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks.

    Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat Sharmin Islam, Dhabaleswar K. Panda

    Proceedings of the 3rd Workshop on Big Data Benchmarking, 2013

    Keywords: dblp

  • [IPDPS Workshops '13]  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.

    Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. Panda

    Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2013

    Keywords: dblp

  • [SoCC '13]  Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

    Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat S. Islam, Dhabaleswar K. Panda

    Proceedings of ACM Symposium on Cloud Computing, 2013  (Poster Paper)

    Keywords: dblp

  • [HotI '13]  Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

    Nusrat S. Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

    Proceedings of Annual Symposium on High-Performance Interconnects, 2013  (Short Paper)

    Keywords: dblp

  • [ICPP '13]  High-Performance Design of Hadoop RPC with RDMA over InfiniBand.

    Xiaoyi Lu, Nusrat S. Islam, Md. Wasi ur Rahman, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. Panda

    Proceedings of International Conference on Parallel Processing, 2013

    Keywords: dblp

  • [CCGRID '13]  SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience.  (Best Presentation Award Nomination)

    Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Chaitanya Kandalla, Mark Daniel Arnold, Dhabaleswar K. Panda

    Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013

    Keywords: dblp

  • [CLUSTER '13]  A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters.  (Best Student Paper Award)

    Rong Shi, Sreeram Potluri, Khaled Hamidouche, Xiaoyi Lu, Karen Tomko, Dhabaleswar K. Panda

    Proceedings of IEEE International Conference on Cluster Computing, 2013

    Keywords: dblp

  • [WBDB '12]  A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters.

    Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Jithin Jose, Dhabaleswar K. Panda

    Proceedings of the 3rd Workshop on Big Data Benchmarking, 2012

    Keywords: dblp

  • [ISPA '11]  Vega LingCloud: A Resource Single Leasing Point System to Support Heterogeneous Application Modes on Shared Infrastructure.  (Best Paper Award)

    Xiaoyi Lu, Jian Lin, Li Zha, Zhiwei Xu

    Proceedings of IEEE International Symposium on Parallel and Distributed Processing with Applications, 2011

    Keywords: dblp

  • [ICPP Workshops '11]  Can MPI Benefit Hadoop and MapReduce Applications?

    Xiaoyi Lu, Bing Wang, Li Zha, Zhiwei Xu

    Workshop Proceedings of International Conference on Parallel Processing, 2011

    Keywords: dblp

  • [NAS '10]  VegaWarden: A Uniform User Management System for Cloud Applications.

    Jian Lin, Xiaoyi Lu, Lin Yu, Yongqiang Zou, Li Zha

    Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2010

    Keywords: dblp

  • [SERVICES '10]  Investigating, Modeling, and Ranking Interface Complexity of Web Services on the World Wide Web.

    Xiaoyi Lu, Jian Lin, Yongqiang Zou, Juan Peng, Xingwu Liu, Li Zha

    Proceedings of World Congress on Services, 2010

    Keywords: dblp

  • [NPC '10]  JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure.

    Juan Peng, Xiaoyi Lu, Boqun Cheng, Li Zha

    Proceedings of IFIP International Conference on Network and Parallel Computing, 2010

    Keywords: dblp

  • [SERVICES '09]  A Model of Message-Based Debugging Facilities for Web or Grid Services.

    Qiang Yue, Xiaoyi Lu, Zhiguang Shan, Zhiwei Xu, Haiyan Yu, Li Zha

    Proceedings of World Congress on Services, 2009

    Keywords: dblp

  • [PDCAT '09]  ICOMC: Invocation Complexity Of Multi-Language Clients for Classified Web Services and its Impact on Large Scale SOA Applications.

    Xiaoyi Lu, Yongqiang Zou, Fei Xiong, Jian Lin, Li Zha

    Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2009

    Keywords: dblp

  • [PDCAT '08]  An Experimental Analysis for Memory Usage of GOS Core.

    Xiaoyi Lu, Qiang Yue, Yongqiang Zou, Xiaoning Wang

    Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2008  (Short Paper)

    Keywords: dblp