Proceedings of The 32nd IEEE International Conference on Network Protocols (ICNP), 2024 (Acceptance Rate: 50/205=24%)
Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
Abstract: With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize SZx, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 1.8-2.7X.
IEEE Micro, 2024
Keywords: training;parallel processing;data models;computational modeling;decoding;tcpip;synchronization;high-speed networks;large language models
Abstract: Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91
IEEE Micro, 2024
Keywords: engines;task analysis;data compression;throughput;memory management;hardware acceleration;distributed databases;data processing;smart devices;system-on-chip
Abstract: A data processing unit (DPU) with programmable smart network interface card containing system-on-chip (SoC) cores is now a valuable addition to the host CPU, finding use in high-performance computing (HPC) and data center clusters for its advanced features, notably, a hardware-based data compression engine (C-engine). With the convergence of big data, HPC, and machine learning, data volumes burden communication and storage, making efficient compression vital. This positions DPUs as tools to accelerate compression workloads and enhance data-intensive applications. This article characterizes lossy (e.g., SZ3) and lossless (e.g., DEFLATE, lz4, and zlib) compression algorithms using seven real-world datasets on Nvidia BlueField-2/-3 DPUs. We explore the potential opportunities for offloading these compression workloads from the host. Our findings demonstrate that the C-engine within the DPU can achieve up to 26.8x speedup compared to its SoC core. We also provide insights on harnessing BlueField for compression, presenting seven crucial takeaways to steer future compression research with DPUs.
Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024 (Best Paper Award Nomination)
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024
Keywords: Code Transformation, Compiler, Datapath Protection, High-Performance Computing (HPC), Reliability, Soft Errors
Abstract: With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024
Keywords: Collective Communication, Distributed Computing, Homomorphic Compression, Parallel Algorithm
Abstract: As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate hZCCL with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 GB/s, surpassing the conventional DOC workflow by up to 36.53\texttimes . Moreover, our hZCCL-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12\texttimes and 6.77\texttimes compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.
Proceedings of the 38th International Conference on Supercomputing (ICS), 2024
Abstract: GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023
2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023 (Poster Paper)
Machine Learning for Systems 2023 (MlSys Workshop at NeurIPS 2023), 2023
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2023
Keywords: Group testing, Bayesian, Lattices, Spark, COVID-19
Abstract: The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9
Journal of Computer Science and Technology, 2023 (Invited Paper for the Special Issue in Honor of Professor Kai Hwang’s 80th Birthday)
Keywords: collective; deep learning; distributed training; GPUDirect; RDMA (remote direct memory access)
Abstract: Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
Keywords: LLM, RDMA, Performance
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
Keywords: OpenSHMEM, PGAS, Omni-Path, Performance
2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
Keywords: Storage Disaggregation, NVMe-over-Fabric, SPDK
Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023
The 1st International Workshop on Hot Topics in System Infrastructure (HotInfra), in conjunction with International Symposium on Computer Architecture (ISCA), 2023
The MIT Press, 2022
Abstract: Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.
Proc. VLDB Endow., 2022
Abstract: To allow performance comparison across different systems, our community has developed multiple benchmarks, such as TPC-C and YCSB, which are widely used. However, despite such effort, interpreting and comparing performance numbers is still a challenging task, because one can tune benchmark parameters, system features, and hardware settings, which can lead to very different system behaviors. Such tuning creates a long-standing question of whether the conclusion of a work can hold under different settings.This work tries to shed light on this question by reproducing 11 works evaluated under TPC-C and YCSB, measuring their performance under a wider range of settings, and investigating the reasons for the change of performance numbers. By doing so, this paper tries to motivate the discussion about whether and how we should address this problem. While this paper does not give a complete solution—this is beyond the scope of a single paper, it proposes concrete suggestions we can take to improve the state of the art.
Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022
Keywords: nvme-over-fabrics, shared memory, hpc cloud, spdk
Abstract: Applications running inside containers or virtual machines, traditionally use TCP/IP for communication in HPC clouds and data centers. The TCP/IP path usually becomes a major performance bottleneck for applications performing NVMe-over-Fabrics (NVMe-oF) based I/O operations in disaggregated storage settings. We propose an adaptive communication channel, called NVMe-over-Adaptive-Fabric (NVMe-oAF), that applications could leverage to eliminate the high-latency and low-bandwidth incurred by remote I/O requests over TCP/IP. NVMe-oAF accelerates I/O intensive applications using locality awareness along with optimized shared memory and TCP/IP paths. The adaptiveness of the fabric stems from the ability to adaptively select shared memory or TCP channel and further applying optimizations for the chosen channel. To evaluate NVMe-oAF, we co-design Intel's SPDK library with our designs and show up to 7.1x bandwidth improvement and up to 4.2x latency reduction for various workloads over commodity TCP/IP-based Ethernet networks (e.g., 10Gbps, 25Gbps, and 100Gbps). We achieve similar (or sometimes better) performance when compared to NVMe-over-RDMA by avoiding the cumbersome management of RDMA in HPC cloud environments. Finally, we also co-design NVMe-oAF with H5bench to showcase the benefit it brings to HDF5 applications. Our evaluation indicates up to a 7x bandwidth improvement when compared with the network file system (NFS).
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022 (Acceptance Rate: 26%, 35/131)
Biostatistics, 2022 (kxac004)
Abstract: A Bayesian framework for group testing under dilution effects has been developed, using lattice-based models. This work has particular relevance given the pressing public health need to enhance testing capacity for coronavirus disease 2019 and future pandemics, and the need for wide-scale and repeated testing for surveillance under constantly varying conditions. The proposed Bayesian approach allows for dilution effects in group testing and for general test response distributions beyond just binary outcomes. It is shown that even under strong dilution effects, an intuitive group testing selection rule that relies on the model order structure, referred to as the Bayesian halving algorithm, has attractive optimal convergence properties. Analogous look-ahead rules that can reduce the number of stages in classification by selecting several pooled tests at a time are proposed and evaluated as well. Group testing is demonstrated to provide great savings over individual testing in the number of tests needed, even for moderately high prevalence levels. However, there is a trade-off with higher number of testing stages, and increased variability. A web-based calculator is introduced to assist in weighing these factors and to guide decisions on when and how to pool under various conditions. High-performance distributed computing methods have also been implemented for considering larger pool sizes, when savings from group testing can be even more dramatic.
BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2022
Keywords: Benchmarks, Interconnects, RDMA
Abstract: Understanding the designs and performance characterizations of hot interconnects on modern data center and high-performance computing (HPC) clusters is a fruitful research topic in recent years. The rapid and continuous growth of high-bandwidth and low-latency communication requirements for various types of data center and HPC applications (such as big data, deep learning, and microservices) has been pushing the envelope of advanced interconnect designs. We believe this is high time to investigate the performance characterizations of representative hot interconnects with different benchmarks. Hence, this paper presents an extensive survey of state-of-the-art hot interconnects on data center and HPC clusters and the associated representative benchmarks to help the community to better understand modern interconnects. In addition, we characterize these interconnects by the related benchmarks under different application scenarios. We provide our perspectives on benchmarking data center interconnects based on our survey, experiments, and results.
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2022
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2021
Keywords: Checkpoint/Restart, NVMe, NVMf, Exascale
Abstract: Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.
The Second Workshop On Resource Disaggregation and Serverless (WORDS’21), co-located with ASPLOS 2021, 2021 (Vision Paper)
Keywords: Offloadable, Migratable, Microservice, Disaggregated Architectures
Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2021
Keywords: Tailless, Quiescent-Free, Object Store, PMEM
Proceedings of the VLDB Endowment, 2021
Abstract: High capacity persistent memory (PMEM) is finally commercially available in the form of Intel’s Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture. In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
2021 IEEE/ACM Symposium on Edge Computing (SEC), 2021
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2020
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
Keywords: next generation networking, data storage systems, persistent memory, key-value stores, RDMA
Abstract: Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as <u>R</u>emote <u>D</u>irect <u>M</u>emory <u>P</u>ersistence based <u>K</u>ey-<u>V</u>alue stores (RDMP-KV). RDMP-KV employs a hybrid 'server-reply/server-bypass' approach to 'durably' store individual key-value objects on PMEM-equipped servers. RDMP-KV's runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-'Server-Reply' protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65
Journal of Computer Science and Technology, 2020
Keywords: CirroData;high performance;SQL-on-Hadoop;online analytical processing (OLAP);Big Data
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
Keywords: erasure coding, fault tolerance, in-network computing, next generation networking
Abstract: Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2019
Keywords: dblp
Proceedings of Annual Non-Volatile Memories Workshop, 2019 (Poster Paper)
Keywords: Non-Volatile Memory, In-Memory Datastore, Remote Direct Memory Persistence
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019 (Poster Paper)
Keywords: Edge Computing, AI, Object Detection
IEEE Transactions on Parallel and Distributed Systems, 2019
Keywords: graphics processing units;learning (artificial intelligence);message passing;neural nets;parallel processing;performance evaluation;model-oriented analysis;GPU clusters;streaming-based broadcast schemes;InfiniBand hardware multicast;IB-MCAST;NVIDIA GPUDirect technology;streaming learning applications;message transmission;deep learning;high-performance computing;graphics processing unit-based applications;message passing interface;remote direct memory access technology;GPUDirect RDMA technology;HPC clusters;performance evaluation;Graphics processing units;Hardware;Analytical models;Machine learning;Clustering algorithms;Scalability;Bandwidth;Broadcast;deep learning;hardware multicast;GPU;GPUDirect RDMA;heterogeneous broadcast;streaming
Abstract: Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.
CCF Transactions on High Performance Computing, 2019 (Invited Paper)
Abstract: Over the last decade, technologies derived from convolutional neural networks (CNNs) called Deep Learning applications, have revolutionized fields as diverse as cancer detection, self-driving cars, virtual assistants, etc. However, many users of such applications are not experts in Machine Learning itself. Consequently, there is limited knowledge among the community to run such applications in an optimized manner. The performance question for Deep Learning applications has typically been addressed by employing bespoke hardware (e.g., GPUs) better suited for such compute-intensive operations. However, such a degree of performance is only accessibly at increasingly high financial costs leaving only big corporations and governments with resources sufficient enough to employ them at a large scale. As a result, an average user is only left with access to commodity clusters with, in many cases, only CPUs as the sole processing element. For such users to make effective use of resources at their disposal, concerted efforts are necessary to figure out optimal hardware and software configurations. This study is one such step in this direction as we use the Roofline model to perform a systematic analysis of representative CNN models and identify opportunities for black box and application-aware optimizations. Using the findings from our study, we are able to obtain up to 3.5\$\$\times \$\$×speedup compared to vanilla TensorFlow with default configurations.
Proceedings of IEEE International Symposium on Workload Characterization, 2019
Keywords: dblp
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019
Keywords: NIC offload, tripartite, bipartite, erasure coding
Abstract: Erasure Coding (EC) NIC offload is a promising technology for designing next-generation distributed storage systems. However, this paper has identified three major limitations of current-generation EC NIC offload schemes on modern SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the tripartite graph model, namely TriEC. TriEC supports both encode-and-send and receive-and-decode operations efficiently. Through theorem-based proofs, co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we show that TriEC is correct and can deliver better performance than the state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding and recovering, respectively. With extended YCSB workloads, TriEC reduces the average write latency by up to 23.2
Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2019
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2019
Keywords: dblp
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019 (ACM Student Research Competition Poster)
Keywords: Erasure Coding, Storage Systems, SmartNIC, GPGPU
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2019
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2018
Keywords: dblp
Frontiers of Information Technology & Electronic Engineering, 2018 (Invited Vision Paper)
Abstract: With the significant advancement in emerging processor, memory, and networking technologies, exascale systems will become available in the next few years (2020–2022). As the exascale systems begin to be deployed and used, there will be a continuous demand to run next-generation applications with finer granularity, finer time-steps, and increased data sizes. Based on historical trends, next-generation applications will require postexascale systems during 2025–2035. In this study, we focus on the networking and communication challenges for post-exascale systems. Firstly, we present an envisioned architecture for post-exascale systems. Secondly, the challenges are summarized from different perspectives: heterogeneous networking technologies, high-performance communication and synchronization protocols, integrated support with accelerators and field-programmable gate arrays, fault-tolerance and quality-of-service support, energy-aware communication schemes and protocols, softwaredefined networking, and scalable communication protocols with heterogeneous memory and storage. Thirdly, we present the challenges in designing efficient programming model support for high-performance computing, big data, and deep learning on these systems. Finally, we emphasize the critical need for co-designing runtime with upper layers on these systems to achieve the maximum performance and scalability.
IEEE Transactions on Multi-Scale Computing Systems, 2018
Keywords: machine learning;big data;graphics processing units;training data;parallel processing;deep learning
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2018 (Short Paper)
Keywords: dblp
Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2018
Proceedings of International Conference on EuroMPI, 2018
Keywords: dblp
Proceedings of the International Supercomputing Conference, 2018 (Poster Paper)
Keywords: Unified Memory, Pascal, Volta, Out-of-Core DNN Training
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018
Keywords: dblp
Journal of Parallel and Distributed Computing, 2018
Keywords: MapReduce, HPC, Tuning, Prediction
Abstract: MapReduce is the most popular parallel computing framework for big data processing which allows massive scalability across distributed computing environment. Advanced RDMA-based design of Hadoop MapReduce has been proposed that alleviates the performance bottlenecks in default Hadoop MapReduce by leveraging the benefits from RDMA. On the other hand, data processing engine, Spark, provides fast execution of MapReduce applications through in-memory processing. Performance optimization for these contemporary big data processing frameworks on modern High-Performance Computing (HPC) systems is a formidable task because of the numerous configuration possibilities in each of them. In this paper, we propose MR-Advisor, a comprehensive tuning, profiling, and prediction tool for MapReduce. MR-Advisor is generalized to provide performance optimizations for Hadoop, Spark, and RDMA-enhanced Hadoop MapReduce designs over different file systems such as HDFS, Lustre, and Tachyon. Performance evaluations reveal that, with MR-Advisor’s suggested values, the job execution performance can be enhanced by a maximum of 58
Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2018
Keywords: dblp
Proceedings of Annual Non-Volatile Memories Workshop, 2018 (Poster Paper)
Keywords: Non-Volatile Memory, RDMA, MapReduce, DAG
Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2018
Keywords: dblp
Proceedings of ACM Symposium on Cloud Computing, 2018 (Poster Paper)
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2018
Keywords: dblp
Proceedings of Annual Symposium on High-Performance Interconnects, 2017
Keywords: dblp
Proceedings of IEEE International Conference on Distributed Computing Systems, 2017
Keywords: dblp
IEEE Transactions on Parallel and Distributed Systems, 2017
Keywords: Big Data;distributed databases;network operating systems;parallel architectures;resource allocation;storage management;storage media;workstation clusters;intermediate data placement;HPC clusters;high performance interconnects;parallel file systems;high performance computing clusters;data analytics;Big Data;HPC technologies;local storage media;Lustre-based global storage;MapReduce over Lustre deployments;high-performance YARN MapReduce design;storage provider;intermediate data storage;priority directory selection;RDMA-enhanced MapReduce;shuffle-intensive workloads;leadership-class HPC systems;job execution;Computer architecture;Servers;High performance computing;Data analysis;Big data;Memory;Big data;high performance computing;RDMA;MapReduce;lustre
Abstract: With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustrebased global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMAenhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.
Proceedings of IEEE International Conference on Big Data, 2017
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2017
Keywords: dblp
Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2017
Keywords: dblp
Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2017
Keywords: dblp
Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2017
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2017 (Short Paper)
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2017 (Short Paper)
Keywords: dblp
Proceedings of Annual Non-Volatile Memories Workshop, 2017 (Position Paper)
Keywords: Non-Volatile Memory, RDMA, Big Data
Research Advances in Cloud Computing, 2017
Keywords: dblp
Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2017
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2017
Keywords: dblp
IEEE Data Eng. Bull., Bulletin of the Technical Committee on Data Engineering, 2017 (Invited Paper)
Keywords: dblp
Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2017
Keywords: dblp
Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2017
Keywords: dblp
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2017
Keywords: dblp
Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2016
Keywords: dblp
Proceedings of International Symposium on Computer Architecture and High Performance Computing, 2016
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2016
Keywords: dblp
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016
Keywords: dblp
Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016
Keywords: dblp
Proceedings of International European Conference on Parallel Processing, 2016
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2016
Keywords: dblp
Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2016
Keywords: dblp
Proceedings of International Conference on Supercompuing, 2016
Keywords: dblp
The Journal of Supercomputing, 2016
Abstract: With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15
Proceedings of The 7th International Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7), 2016
Proceedings of IEEE International Conference on Big Data, 2016 (Short Paper)
Keywords: dblp
Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016
Keywords: dblp
Proceedings of Annual Conference on Extreme Science and Engineering Discovery Environment, 2016
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2016
Keywords: dblp
Conquering Big Data with High Performance Computing, 2016
Abstract: Modern HPC systems and the associated middleware (such as MPI and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architecture, RDMA-enabled networking, and SSD) for many years. However, Big Data processing and management middleware have not fully taken advantage of such technologies. These disparities are taking HPC and Big Data processing into divergent trajectories. This chapter provides an overview of popular Big Data processing middleware, high-performance interconnects and storage architectures, and discusses the challenges in accelerating Big Data processing middleware by leveraging emerging technologies on modern HPC clusters. This chapter presents case studies of advanced designs based on RDMA and heterogeneous storage architecture, that were proposed to address these challenges for multiple components of Hadoop (HDFS and MapReduce) and Spark. The advanced designs presented in the case studies are publicly available as a part of the High-Performance Big Data (HiBD) project. An overview of the HiBD project is also provided in this chapter. All these works aim to bring HPC and Big Data processing into a convergent trajectory.
Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016
Keywords: dblp
Proceedings of the International Supercomputing Conference, 2016
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2016
Keywords: dblp
Proceedings of Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, 2016
Keywords: dblp
Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2015
Keywords: dblp
Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2015 (Short Paper)
Keywords: dblp
Proceedings of IEEE International Conference on Distributed Computing Systems, 2015
Keywords: dblp
Proceedings of IEEE International Conference on Big Data Computing Service and Applications, 2015 (Short Paper)
Keywords: dblp
Journal of Computer Science and Technology, 2015
Abstract: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.
Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015
Keywords: dblp
Proceedings of Workshop on OpenSHMEM and Related Technologies, 2015
Keywords: dblp
Proceedings of International European Conference on Parallel Processing, 2015
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2015
Keywords: dblp
Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2015
Keywords: dblp
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2015
Keywords: dblp
Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2015 (Poster Paper)
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2015
Keywords: dblp
Proceedings of IEEE International Conference on Big Data, 2014 (Short Paper)
Keywords: dblp
Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014
Keywords: dblp
Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014 (Poster Paper)
Keywords: dblp
Proceedings of International European Conference on Parallel Processing, 2014
Keywords: dblp
Proceedings of IEEE International Conference on High Performance Computing, 2014
Keywords: dblp
Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2014
Keywords: dblp
Proceedings of Annual Symposium on High-Performance Interconnects, 2014
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2014
Keywords: dblp
Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2014 (Short Paper)
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2014
Keywords: dblp
Proceedings of Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014
Abstract: Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4 V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.
Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2014
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2014
Keywords: dblp
Proceedings of International European Conference on Parallel Processing, 2014
Keywords: dblp
Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2014
Keywords: dblp
Proceedings of International Conference on Supercompuing, 2014
Keywords: dblp
Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014
Keywords: dblp
Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014
Keywords: dblp
Proceedings of the 3rd Workshop on Big Data Benchmarking, 2013
Keywords: dblp
Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2013
Keywords: dblp
Proceedings of ACM Symposium on Cloud Computing, 2013 (Poster Paper)
Keywords: dblp
Proceedings of Annual Symposium on High-Performance Interconnects, 2013 (Short Paper)
Keywords: dblp
Proceedings of International Conference on Parallel Processing, 2013
Keywords: dblp
Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013
Keywords: dblp
Proceedings of IEEE International Conference on Cluster Computing, 2013
Keywords: dblp
Proceedings of the 3rd Workshop on Big Data Benchmarking, 2012
Keywords: dblp
Proceedings of IEEE International Symposium on Parallel and Distributed Processing with Applications, 2011
Keywords: dblp
Workshop Proceedings of International Conference on Parallel Processing, 2011
Keywords: dblp
Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2010
Keywords: dblp
Proceedings of World Congress on Services, 2010
Keywords: dblp
Proceedings of IFIP International Conference on Network and Parallel Computing, 2010
Keywords: dblp
Proceedings of World Congress on Services, 2009
Keywords: dblp
Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2009
Keywords: dblp
Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2008 (Short Paper)
Keywords: dblp