PADSYS Lab | Publications

Conference Proceedings
[ICNP '24] KSpeed: Beating I/O Bottlenecks of Data Provisioning for RDMA Training Clusters
Jianbo Dong, Hao Qi, Tianjing Xu, Xiaoli Liu, Chen Wei, Rongyao, Wang, Xiaoyi Lu, Zheng Cao, Binzhang Fu

Proceedings of The 32nd IEEE International Conference on Network Protocols (ICNP), 2024 (Acceptance Rate: 50/205=24%)
- Citation
Cite "KSpeed: Beating I/O Bottlenecks of Data Provisioning for RDMA Training Clusters"
- Plain text
- BibTeX
Jianbo Dong, Hao Qi, Tianjing Xu, Xiaoli Liu, Chen Wei, Rongyao, Wang, Xiaoyi Lu, Zheng Cao, and Binzhang Fu. KSpeed: Beating I/O Bottlenecks of Data Provisioning for RDMA Training Clusters. In Proceedings of The 32nd IEEE International Conference on Network Protocols (ICNP), ICNP '24. Charleroi, Belgium, October 28-31 2024. IEEE Computer Society. Acceptance Rate: 50/205=24%.
@inproceedings{conf-icnp24-jianbo, title="{KSpeed: Beating I/O Bottlenecks of Data Provisioning for RDMA Training Clusters}", author={Dong, Jianbo and Qi, Hao and Xu, Tianjing and Liu, Xiaoli and Wei, Chen and Wang, Rongyao, and Lu, Xiaoyi and Cao, Zheng and Fu, Binzhang}, booktitle={Proceedings of The 32nd IEEE International Conference on Network Protocols (ICNP)}, publisher = {IEEE Computer Society}, series = {ICNP '24}, year={2024}, address={Charleroi, Belgium}, note={Acceptance Rate: 50/205=24\%}, month={October 28-31} }
Conference Proceedings
[IPDPS '24] An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
- Abstract
- Citation
Abstract: With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize SZx, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 1.8-2.7X.
Cite "An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression"
- Plain text
- BibTeX
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression. In Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), IPDPS '24. San Francisco, California, USA, May 27-31 2024. IEEE Computer Society.
@inproceedings{conf-ipdps-jiajun24, title="{An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression}", author={Huang, Jiajun and Di, Sheng and Yu, Xiaodong and Zhai, Yujia and Zhang, Zhaorui and Liu, Jinyang and Lu, Xiaoyi and Raffenetti, Ken and Zhou, Hui and Zhao, Kai and Chen, Zizhong and Cappello, Franck and Guo, Yanfei and Thakur, Rajeev}, booktitle={Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE Computer Society}, series = {IPDPS '24}, abstract = {With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize SZx, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 1.8-2.7X.}, year={2024}, address={San Francisco, California, USA}, month={May 27-31} }

[IEEE Micro'24] High-Speed Data Communication with Advanced Networks in Large Language Model Training

Liuyao Dai, Hao Qi, Weicong Chen, Xiaoyi Lu

IEEE Micro, 2024

Keywords: training;parallel processing;data models;computational modeling;decoding;tcpip;synchronization;high-speed networks;large language models

Abstract: Large language models (LLMs) like Generative Pre-trained Transformer, Bidirectional Encoder Representations from Transformers, and T5 are pivotal in natural language processing. Their distributed training is influenced by high-speed interconnects. This article characterizes their training performance across various interconnects and communication protocols: TCP/IP, Internet Protocol over InfiniBand, (IPoIB), and Remote Direct Memory Access (RDMA), using data and model parallelism. RDMA-100 Gbps outperforms IPoIB-100 Gbps and TCP/IP-10 Gbps, with average gains of 2.5x and 4.8x in data parallelism, while in model parallelism, the gains were 1.1x and 1.2x. RDMA achieves the highest interconnect utilization (up to 60 Gbps), compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps. Larger models demand increased communication bandwidth, with AllReduce in data parallelism consuming up to 91

Journal Article
[IEEE Micro'24] Compression Analysis for BlueField-2/3 Data Processing Units: Lossy and Lossless Perspectives
Yuke Li, Arjun Kashyap, Yanfei Guo, Xiaoyi Lu

IEEE Micro, 2024
Keywords: engines;task analysis;data compression;throughput;memory management;hardware acceleration;distributed databases;data processing;smart devices;system-on-chip

Abstract: A data processing unit (DPU) with programmable smart network interface card containing system-on-chip (SoC) cores is now a valuable addition to the host CPU, finding use in high-performance computing (HPC) and data center clusters for its advanced features, notably, a hardware-based data compression engine (C-engine). With the convergence of big data, HPC, and machine learning, data volumes burden communication and storage, making efficient compression vital. This positions DPUs as tools to accelerate compression workloads and enhance data-intensive applications. This article characterizes lossy (e.g., SZ3) and lossless (e.g., DEFLATE, lz4, and zlib) compression algorithms using seven real-world datasets on Nvidia BlueField-2/-3 DPUs. We explore the potential opportunities for offloading these compression workloads from the host. Our findings demonstrate that the C-engine within the DPU can achieve up to 26.8x speedup compared to its SoC core. We also provide insights on harnessing BlueField for compression, presenting seven crucial takeaways to steer future compression research with DPUs.
Cite "Compression Analysis for BlueField-2/3 Data Processing Units: Lossy and Lossless Perspectives"
- Plain text
- BibTeX
Yuke Li, Arjun Kashyap, Yanfei Guo, and Xiaoyi Lu. Compression Analysis for BlueField-2/3 Data Processing Units: Lossy and Lossless Perspectives. IEEE Micro, 44(02):8-19, mar 2024. doi:10.1109/MM.2023.3343636.
@article{journals-micro24-yuke, author = {Li, Yuke and Kashyap, Arjun and Guo, Yanfei and Lu, Xiaoyi}, title = "{Compression Analysis for BlueField-2/3 Data Processing Units: Lossy and Lossless Perspectives}", journal = {IEEE Micro}, year = {2024}, volume = {44}, number = {02}, issn = {1937-4143}, pages = {8-19}, abstract = {A data processing unit (DPU) with programmable smart network interface card containing system-on-chip (SoC) cores is now a valuable addition to the host CPU, finding use in high-performance computing (HPC) and data center clusters for its advanced features, notably, a hardware-based data compression engine (C-engine). With the convergence of big data, HPC, and machine learning, data volumes burden communication and storage, making efficient compression vital. This positions DPUs as tools to accelerate compression workloads and enhance data-intensive applications. This article characterizes lossy (e.g., SZ3) and lossless (e.g., DEFLATE, lz4, and zlib) compression algorithms using seven real-world datasets on Nvidia BlueField-2/-3 DPUs. We explore the potential opportunities for offloading these compression workloads from the host. Our findings demonstrate that the C-engine within the DPU can achieve up to 26.8x speedup compared to its SoC core. We also provide insights on harnessing BlueField for compression, presenting seven crucial takeaways to steer future compression research with DPUs.}, keywords = {engines;task analysis;data compression;throughput;memory management;hardware acceleration;distributed databases;data processing;smart devices;system-on-chip}, doi = {10.1109/MM.2023.3343636}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, series = {IEEE Micro'24}, month = {mar} }
Conference Proceedings
[IPDPS '24] Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures
Yuke Li, Arjun Kashyap, Weicong Chen, Yanfei Guo, Xiaoyi Lu

Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024 (Best Paper Award Nomination)
- Citation
Cite "Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures"
- Plain text
- BibTeX
Yuke Li, Arjun Kashyap, Weicong Chen, Yanfei Guo, and Xiaoyi Lu. Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures. In Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), IPDPS '24. San Francisco, California, USA, May 27-31 2024. IEEE Computer Society. Best Paper Award Nomination.
@inproceedings{conf-ipdps24-yuke, title="{Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures}", author={Li, Yuke and Kashyap, Arjun and Chen, Weicong and Guo, Yanfei and Lu, Xiaoyi}, booktitle={Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE Computer Society}, series = {IPDPS '24}, year={2024}, address={San Francisco, California, USA}, note={Best Paper Award Nomination}, month={May 27-31} }
Conference Proceedings
[SC '24] Versatile Datapath Soft Error Detection on the Cheap for HPC Applications
Yafan Huang, Sheng Di, Zhaorui Zhang, Xiaoyi Lu, Guanpeng Li

Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024
- Abstract
- Keywords
- Citation
- Full text
Keywords: Code Transformation, Compiler, Datapath Protection, High-Performance Computing (HPC), Reliability, Soft Errors

Abstract: With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79
Cite "Versatile Datapath Soft Error Detection on the Cheap for HPC Applications"
- Plain text
- BibTeX
Yafan Huang, Sheng Di, Zhaorui Zhang, Xiaoyi Lu, and Guanpeng Li. Versatile Datapath Soft Error Detection on the Cheap for HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '24. IEEE Press, 2024. URL: https://doi.org/10.1109/SC41406.2024.00061, doi:10.1109/SC41406.2024.00061.
@inproceedings{conf-sc24-yafan, author = {Huang, Yafan and Di, Sheng and Zhang, Zhaorui and Lu, Xiaoyi and Li, Guanpeng}, title = "{Versatile Datapath Soft Error Detection on the Cheap for HPC Applications}", year = {2024}, isbn = {9798350352917}, publisher = {IEEE Press}, url = {https://doi.org/10.1109/SC41406.2024.00061}, doi = {10.1109/SC41406.2024.00061}, abstract = {With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79\% runtime overhead, which is 41.84\% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis}, articleno = {55}, numpages = {15}, keywords = {Code Transformation, Compiler, Datapath Protection, High-Performance Computing (HPC), Reliability, Soft Errors}, location = {Atlanta, GA, USA}, series = {SC '24} }
Conference Proceedings
[SC '24] hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic Compression
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Zizhe Jian, Xin Liang, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024
- Abstract
- Keywords
- Citation
- Full text
Keywords: Collective Communication, Distributed Computing, Homomorphic Compression, Parallel Algorithm

Abstract: As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate hZCCL with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 GB/s, surpassing the conventional DOC workflow by up to 36.53\texttimes . Moreover, our hZCCL-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12\texttimes and 6.77\texttimes compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.
Cite "hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic Compression"
- Plain text
- BibTeX
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Zizhe Jian, Xin Liang, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic Compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '24. IEEE Press, 2024. URL: https://doi.org/10.1109/SC41406.2024.00110, doi:10.1109/SC41406.2024.00110.
@inproceedings{conf-sc24-jiajun, author = {Huang, Jiajun and Di, Sheng and Yu, Xiaodong and Zhai, Yujia and Liu, Jinyang and Jian, Zizhe and Liang, Xin and Zhao, Kai and Lu, Xiaoyi and Chen, Zizhong and Cappello, Franck and Guo, Yanfei and Thakur, Rajeev}, title = "{hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic Compression}", year = {2024}, isbn = {9798350352917}, publisher = {IEEE Press}, url = {https://doi.org/10.1109/SC41406.2024.00110}, doi = {10.1109/SC41406.2024.00110}, abstract = {As network bandwidth struggles to keep up with rapidly growing computing capabilities, the efficiency of collective communication has become a critical challenge for exa-scale distributed and parallel applications. Traditional approaches directly utilize error-bounded lossy compression to accelerate collective computation operations, exposing unsatisfying performance due to the expensive decompression-operation-compression (DOC) workflow. To address this issue, we present a first-ever homomorphic compression-communication co-design, hZCCL, which enables operations to be performed directly on compressed data, saving the cost of time-consuming decompression and recompression. In addition to the co-design framework, we build a light-weight compressor, optimized specifically for multi-core CPU platforms. We also present a homomorphic compressor with a run-time heuristic to dynamically select efficient compression pipelines for reducing the cost of DOC handling. We evaluate hZCCL with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 GB/s, surpassing the conventional DOC workflow by up to 36.53\texttimes{}. Moreover, our hZCCL-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12\texttimes{} and 6.77\texttimes{} compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis}, articleno = {104}, numpages = {15}, keywords = {Collective Communication, Distributed Computing, Homomorphic Compression, Parallel Algorithm}, location = {Atlanta, GA, USA}, series = {SC '24} }
Conference Proceedings
[ICS '24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

Proceedings of the 38th International Conference on Supercomputing (ICS), 2024
- Abstract
- Citation
Abstract: GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
Cite "gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters"
- Plain text
- BibTeX
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters. In Proceedings of the 38th International Conference on Supercomputing (ICS), ICS '24. Kyoto University, Kyoto, Japan, June 2024.
@inproceedings{conf-ics24-jiajun-gzccl, author = {Huang, Jiajun and Di, Sheng and Yu, Xiaodong and Zhai, Yujia and Liu, Jinyang and Huang, Yafan and Raffenetti, Ken and Zhou, Hui and Zhao, Kai and Lu, Xiaoyi and Chen, Zizhong and Cappello, Franck and Guo, Yanfei and Thakur, Rajeev}, title = "{gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters}", booktitle = {Proceedings of the 38th International Conference on Supercomputing (ICS)}, year = {2024}, address = {Kyoto University, Kyoto, Japan}, month = {June}, series = {ICS '24}, abstract = {GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.} }
Conference Proceedings
[IPDPS '24] DRUTO: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications
Md Hasanur Rahman, Sheng Di, Shengjian Guo, Xiaoyi Lu, Guanpeng Li, Franck Cappello

Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
- Citation
Cite "DRUTO: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications"
- Plain text
- BibTeX
Md Hasanur Rahman, Sheng Di, Shengjian Guo, Xiaoyi Lu, Guanpeng Li, and Franck Cappello. DRUTO: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications. In Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), IPDPS '24. San Francisco, California, USA, May 27-31 2024. IEEE Computer Society.
@inproceedings{conf-ipdps-jasanur24, title="{DRUTO: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications}", author={Rahman, Md Hasanur and Di, Sheng and Guo, Shengjian and Lu, Xiaoyi and Li, Guanpeng and Cappello, Franck}, booktitle={Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE Computer Society}, series = {IPDPS '24}, year={2024}, address={San Francisco, California, USA}, month={May 27-31} }
Conference Proceedings
[IPDPS '24] NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support
Darren Ng, Andrew Lin, Arjun Kashyap, Guanpeng Li, Xiaoyi Lu

Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024
- Citation
Cite "NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support"
- Plain text
- BibTeX
Darren Ng, Andrew Lin, Arjun Kashyap, Guanpeng Li, and Xiaoyi Lu. NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support. In Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), IPDPS '24. San Francisco, California, USA, May 27-31 2024. IEEE Computer Society.
@inproceedings{conf-ipdps24-darren, title="{NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support}", author={Ng, Darren and Lin, Andrew and Kashyap, Arjun and Li, Guanpeng and Lu, Xiaoyi}, booktitle={Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE Computer Society}, series = {IPDPS '24}, year={2024}, address={San Francisco, California, USA}, month={May 27-31} }
Conference Proceedings
[HotI '23] Performance Characterization of Large Language Models on High-Speed Interconnects
Hao Qi, Liuyao Dai, Weicong Chen, Zhen Jia, Xiaoyi Lu

Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023
- Citation
Cite "Performance Characterization of Large Language Models on High-Speed Interconnects"
- Plain text
- BibTeX
Hao Qi, Liuyao Dai, Weicong Chen, Zhen Jia, and Xiaoyi Lu. Performance Characterization of Large Language Models on High-Speed Interconnects. In Proceedings of the 30th IEEE Hot Interconnects Symposium, HotI '23. IEEE Computer Society, August 2023.
@inproceedings{conf-hoti-llm-hao23, author={Qi, Hao and Dai, Liuyao and Chen, Weicong and Jia, Zhen and Lu, Xiaoyi }, title="{Performance Characterization of Large Language Models on High-Speed Interconnects}", location={ ONLINE }, booktitle = {Proceedings of the 30th IEEE Hot Interconnects Symposium}, publisher = {IEEE Computer Society}, series = {HotI '23}, month={ August }, year = 2023 }
Conference Proceedings
[ModSim '23] LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication
Liuyao Dai, Adam Weingram, Hao, Qi, Weicong Chen, Xiaoyi Lu

2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023 (Poster Paper)
- Citation
Cite "LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication"
- Plain text
- BibTeX
Liuyao Dai, Adam Weingram, Hao, Qi, Weicong Chen, and Xiaoyi Lu. LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication. In 2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), ModSim '23. August 2023. Poster Paper.
@inproceedings{conf-modsim-liuyao23, author={Dai, Liuyao and Weingram, Adam and Qi, Hao, and Chen, Weicong and Lu, Xiaoyi}, title={{LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication}}, booktitle={2023 Workshop on Modeling \& Simulation of Systems and Applications (ModSim)}, location={Seattle, WA, USA}, month={ August }, series = {ModSim '23}, note = {Poster Paper}, year={2023} }
Conference Proceedings
[MlSys Workshop'23] Learning Distributed Protocols with Zero Knowledge
Yujie Hui, Drew Ripberger, Xiaoyi Lu, Yang Wang

Machine Learning for Systems 2023 (MlSys Workshop at NeurIPS 2023), 2023
- Citation
- Full text
Cite "Learning Distributed Protocols with Zero Knowledge"
- Plain text
- BibTeX
Yujie Hui, Drew Ripberger, Xiaoyi Lu, and Yang Wang. Learning Distributed Protocols with Zero Knowledge. In Machine Learning for Systems 2023 (MlSys Workshop at NeurIPS 2023), MlSys Workshop'23. 2023. URL: https://openreview.net/forum?id=u0Ncut8ru5.
@inproceedings{conf-mlsysw-yujie23, title="{Learning Distributed Protocols with Zero Knowledge}", author={Hui, Yujie and Ripberger, Drew and Lu, Xiaoyi and Wang, Yang}, booktitle={Machine Learning for Systems 2023 (MlSys Workshop at NeurIPS 2023)}, year={2023}, series = {MlSys Workshop'23}, url={https://openreview.net/forum?id=u0Ncut8ru5} }

[IPDPS '23] SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance

Weicong Chen, Hao Qi, Xiaoyi Lu, Curtis Tatsuoka

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2023

Keywords: Group testing, Bayesian, Lattices, Spark, COVID-19

Abstract: The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9

Journal Article
[JCST '23] xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning
Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, Xiaoyi Lu

Journal of Computer Science and Technology, 2023 (Invited Paper for the Special Issue in Honor of Professor Kai Hwang’s 80th Birthday)
- Abstract
- Keywords
- Citation
- Full text
Keywords: collective; deep learning; distributed training; GPUDirect; RDMA (remote direct memory access)

Abstract: Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.
Cite "xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning"
- Plain text
- BibTeX
Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. Xccl: a survey of industry-led collective communication libraries for deep learning. Journal of Computer Science and Technology, 38(1):166, 2023. Invited Paper for the Special Issue in Honor of Professor Kai Hwang’s 80th Birthday. URL: https://jcst.ict.ac.cn/EN/abstract/article_2965.shtml, doi:10.1007/s11390-023-2894-6.
@article{journals-jcst-Weingram23, author = {Weingram, Adam and Li, Yuke and Qi, Hao and Ng, Darren and Dai, Liuyao and Lu, Xiaoyi}, title = {xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning}, publisher = {Journal of Computer Science and Technology}, year = {2023}, journal = {Journal of Computer Science and Technology}, volume = {38}, number = {1}, eid = {166}, numpages = {29}, pages = {166}, keywords = {collective; deep learning; distributed training; GPUDirect; RDMA (remote direct memory access)}, url = {https://jcst.ict.ac.cn/EN/abstract/article_2965.shtml}, doi = {10.1007/s11390-023-2894-6}, abstract = {Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.}, note = {Invited Paper for the Special Issue in Honor of Professor Kai Hwang’s 80th Birthday}, series = {JCST '23} }
Conference Proceedings
[SC '23] Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters
Hao Qi, Liuyao Dai, Weicong Chen, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
- Keywords
- Citation
Keywords: LLM, RDMA, Performance
Cite "Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters"
- Plain text
- BibTeX
Hao Qi, Liuyao Dai, Weicong Chen, and Xiaoyi Lu. Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '23. New York, NY, USA, Nov. 2023. Association for Computing Machinery. Research Poster Paper.
@inproceedings{conf-sc-hao23-poster, author = {Qi, Hao and Dai, Liuyao and Chen, Weicong and Lu, Xiaoyi}, title = "{Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters}", month = {Nov.}, date = {12-17}, year = {2023}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {LLM, RDMA, Performance}, location = {Denver, Colorado}, note = {Research Poster Paper}, series = {SC '23} }
Conference Proceedings
[SC '23] Characterizing One-/Two-sided Designs in OpenSHMEM Collectives
Yuke Li, Yanfei Guo, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
- Keywords
- Citation
Keywords: OpenSHMEM, PGAS, Omni-Path, Performance
Cite "Characterizing One-/Two-sided Designs in OpenSHMEM Collectives"
- Plain text
- BibTeX
Yuke Li, Yanfei Guo, and Xiaoyi Lu. Characterizing One-/Two-sided Designs in OpenSHMEM Collectives. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '23. New York, NY, USA, Nov. 2023. Association for Computing Machinery. Research Poster Paper.
@inproceedings{conf-sc-yuke23-poster, author = {Li, Yuke and Guo, Yanfei and Lu, Xiaoyi}, title = "{Characterizing One-/Two-sided Designs in OpenSHMEM Collectives}", month = {Nov.}, date = {12-17}, year = {2023}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {OpenSHMEM, PGAS, Omni-Path, Performance}, location = {Denver, Colorado}, note = {Research Poster Paper}, series = {SC '23} }
Conference Proceedings
[ModSim '23] Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation
Weicong Chen, Yuke Li, Arjun Kashyap, Xiaoyi Lu

2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), 2023
- Citation
Cite "Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation"
- Plain text
- BibTeX
Weicong Chen, Yuke Li, Arjun Kashyap, and Xiaoyi Lu. Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation. In 2023 Workshop on Modeling & Simulation of Systems and Applications (ModSim), ModSim '23. August 2023.
@inproceedings{conf-modsim-ben23, title={{Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation}}, author={Chen, Weicong and Li, Yuke and Kashyap, Arjun and Lu, Xiaoyi}, booktitle={2023 Workshop on Modeling \& Simulation of Systems and Applications (ModSim)}, location={Seattle, WA, USA}, month={ August }, series = {ModSim '23}, year={2023} }
Conference Proceedings
[SC '23] An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs
Darren Ng, Charles Parkinson, Andrew Lin, Arjun Kashyap, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023 (Research Poster Paper)
- Keywords
- Citation
Keywords: Storage Disaggregation, NVMe-over-Fabric, SPDK
Cite "An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs"
- Plain text
- BibTeX
Darren Ng, Charles Parkinson, Andrew Lin, Arjun Kashyap, and Xiaoyi Lu. An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '23. New York, NY, USA, Nov. 2023. Association for Computing Machinery. Research Poster Paper.
@inproceedings{conf-sc-darren23-poster, author = {Ng, Darren and Parkinson, Charles and Lin, Andrew and Kashyap, Arjun and Lu, Xiaoyi}, title = "{An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs}", month = {Nov.}, date = {12-17}, year = {2023}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {Storage Disaggregation, NVMe-over-Fabric, SPDK}, location = {Denver, Colorado}, note = {Research Poster Paper}, series = {SC '23} }
Conference Proceedings
[HotI '23] Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures
Yuke Li, Arjun Kashyap, Yanfei Guo, Xiaoyi Lu

Proceedings of the 30th IEEE Hot Interconnects Symposium, 2023
- Citation
Cite "Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures"
- Plain text
- BibTeX
Yuke Li, Arjun Kashyap, Yanfei Guo, and Xiaoyi Lu. Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures. In Proceedings of the 30th IEEE Hot Interconnects Symposium, HotI '23. IEEE Computer Society, August 2023.
@inproceedings{conf-hoti-dpu-yuke23, author={Li, Yuke and Kashyap, Arjun and Guo, Yanfei and Lu, Xiaoyi}, title="{Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures}", location={ ONLINE }, booktitle = {Proceedings of the 30th IEEE Hot Interconnects Symposium}, publisher = {IEEE Computer Society}, series = {HotI '23}, month={ August }, year = 2023 }
Conference Proceedings
[HotInfra '23] On the Discontinuation of Persistent Memory: Looking Back to Look Forward
Tianxi Li, Yang Wang, Xiaoyi Lu

The 1st International Workshop on Hot Topics in System Infrastructure (HotInfra), in conjunction with International Symposium on Computer Architecture (ISCA), 2023
- Citation
- Full text
Cite "On the Discontinuation of Persistent Memory: Looking Back to Look Forward"
- Plain text
- BibTeX
Tianxi Li, Yang Wang, and Xiaoyi Lu. On the Discontinuation of Persistent Memory: Looking Back to Look Forward. In The 1st International Workshop on Hot Topics in System Infrastructure (HotInfra), in conjunction with International Symposium on Computer Architecture (ISCA), HotInfra '23. June 2023. URL: https://hotinfra23.github.io/papers/hotinfra23-paper9.pdf.
@inproceedings{conf-hotinfra-tianxi23, title={{On the Discontinuation of Persistent Memory: Looking Back to Look Forward}}, author={Li, Tianxi and Wang, Yang and Lu, Xiaoyi}, booktitle={The 1st International Workshop on Hot Topics in System Infrastructure (HotInfra), in conjunction with International Symposium on Computer Architecture (ISCA)}, location={Orlando, FL, USA}, month={ June }, date={18}, url={https://hotinfra23.github.io/papers/hotinfra23-paper9.pdf}, series = {HotInfra '23}, year={2023} }
Book
[MIT Press '22] High-Performance Big Data Computing
Dhabaleswar K. Panda, Xiaoyi Lu, Dipti Shankar

The MIT Press, 2022
Abstract: Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.
Cite "High-Performance Big Data Computing"
- Plain text
- BibTeX
Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar. High-Performance Big Data Computing. MIT Press '22. The MIT Press, 2022. ISBN 9780262046855. URL: https://mitpress.mit.edu/9780262046855/high-performance-big-data-computing/.
@book{book-hpbdc-mit22, place = {Cambridge, MA}, title = "{High-Performance Big Data Computing}", publisher = {The MIT Press}, author = {Panda, Dhabaleswar K. and Lu, Xiaoyi and Shankar, Dipti}, year = {2022}, abstract = {Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.}, isbn = {9780262046855}, url="https://mitpress.mit.edu/9780262046855/high-performance-big-data-computing/", series="MIT Press '22" }
Journal Article
[VLDB '22] A Study of Database Performance Sensitivity to Experiment Settings
Yang Wang, Miao Yu, Yujie Hui, Fang Zhou, Yuyang Huang, Rui Zhu, Xueyuan Ren, Tianxi Li, Xiaoyi Lu

Proc. VLDB Endow., 2022
- Abstract
- Citation
- PDF
- Full text
Abstract: To allow performance comparison across different systems, our community has developed multiple benchmarks, such as TPC-C and YCSB, which are widely used. However, despite such effort, interpreting and comparing performance numbers is still a challenging task, because one can tune benchmark parameters, system features, and hardware settings, which can lead to very different system behaviors. Such tuning creates a long-standing question of whether the conclusion of a work can hold under different settings.This work tries to shed light on this question by reproducing 11 works evaluated under TPC-C and YCSB, measuring their performance under a wider range of settings, and investigating the reasons for the change of performance numbers. By doing so, this paper tries to motivate the discussion about whether and how we should address this problem. While this paper does not give a complete solution—this is beyond the scope of a single paper, it proposes concrete suggestions we can take to improve the state of the art.
Cite "A Study of Database Performance Sensitivity to Experiment Settings"
- Plain text
- BibTeX
Yang Wang, Miao Yu, Yujie Hui, Fang Zhou, Yuyang Huang, Rui Zhu, Xueyuan Ren, Tianxi Li, and Xiaoyi Lu. A study of database performance sensitivity to experiment settings. Proc. VLDB Endow., 15(7):1439–1452, mar 2022. URL: https://doi.org/10.14778/3523210.3523221, doi:10.14778/3523210.3523221.
@article{conf-vldb-yang22, author = {Wang, Yang and Yu, Miao and Hui, Yujie and Zhou, Fang and Huang, Yuyang and Zhu, Rui and Ren, Xueyuan and Li, Tianxi and Lu, Xiaoyi}, title = {A Study of Database Performance Sensitivity to Experiment Settings}, year = {2022}, issue_date = {March 2022}, publisher = {VLDB Endowment}, volume = {15}, number = {7}, issn = {2150-8097}, url = {https://doi.org/10.14778/3523210.3523221}, doi = {10.14778/3523210.3523221}, abstract = {To allow performance comparison across different systems, our community has developed multiple benchmarks, such as TPC-C and YCSB, which are widely used. However, despite such effort, interpreting and comparing performance numbers is still a challenging task, because one can tune benchmark parameters, system features, and hardware settings, which can lead to very different system behaviors. Such tuning creates a long-standing question of whether the conclusion of a work can hold under different settings.This work tries to shed light on this question by reproducing 11 works evaluated under TPC-C and YCSB, measuring their performance under a wider range of settings, and investigating the reasons for the change of performance numbers. By doing so, this paper tries to motivate the discussion about whether and how we should address this problem. While this paper does not give a complete solution---this is beyond the scope of a single paper, it proposes concrete suggestions we can take to improve the state of the art.}, journal = {Proc. VLDB Endow.}, month = {mar}, pages = {1439–1452}, numpages = {14}, series = {VLDB '22}, pdf = {https://www.vldb.org/pvldb/vol15/p1439-wang.pdf} }

[arXiv '22] Arcadia: A Fast and Reliable Persistent Memory Replicated Log

Shashank Gugnani, Scott Guthridge, Frank Schmuck, Owen Anderson, Deepavali Bhagwat, Xiaoyi Lu

CoRR, 2022

Conference Proceedings
[HPDC '22] NVMe-oAF: Towards Adaptive NVMe-oF for IO-Intensive Workloads on HPC Cloud
Arjun Kashyap, Xiaoyi Lu

Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022
- Abstract
- Keywords
- Citation
- PDF
- Full text
Keywords: nvme-over-fabrics, shared memory, hpc cloud, spdk

Abstract: Applications running inside containers or virtual machines, traditionally use TCP/IP for communication in HPC clouds and data centers. The TCP/IP path usually becomes a major performance bottleneck for applications performing NVMe-over-Fabrics (NVMe-oF) based I/O operations in disaggregated storage settings. We propose an adaptive communication channel, called NVMe-over-Adaptive-Fabric (NVMe-oAF), that applications could leverage to eliminate the high-latency and low-bandwidth incurred by remote I/O requests over TCP/IP. NVMe-oAF accelerates I/O intensive applications using locality awareness along with optimized shared memory and TCP/IP paths. The adaptiveness of the fabric stems from the ability to adaptively select shared memory or TCP channel and further applying optimizations for the chosen channel. To evaluate NVMe-oAF, we co-design Intel's SPDK library with our designs and show up to 7.1x bandwidth improvement and up to 4.2x latency reduction for various workloads over commodity TCP/IP-based Ethernet networks (e.g., 10Gbps, 25Gbps, and 100Gbps). We achieve similar (or sometimes better) performance when compared to NVMe-over-RDMA by avoiding the cumbersome management of RDMA in HPC cloud environments. Finally, we also co-design NVMe-oAF with H5bench to showcase the benefit it brings to HDF5 applications. Our evaluation indicates up to a 7x bandwidth improvement when compared with the network file system (NFS).
Cite "NVMe-oAF: Towards Adaptive NVMe-oF for IO-Intensive Workloads on HPC Cloud"
- Plain text
- BibTeX
Arjun Kashyap and Xiaoyi Lu. Nvme-oaf: towards adaptive nvme-of for io-intensive workloads on hpc cloud. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '22, 56–70. New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3502181.3531476, doi:10.1145/3502181.3531476.
@inproceedings{conf-hpdc-arjun22, author = {Kashyap, Arjun and Lu, Xiaoyi}, title = {NVMe-oAF: Towards Adaptive NVMe-oF for IO-Intensive Workloads on HPC Cloud}, year = {2022}, isbn = {9781450391993}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3502181.3531476}, doi = {10.1145/3502181.3531476}, abstract = {Applications running inside containers or virtual machines, traditionally use TCP/IP for communication in HPC clouds and data centers. The TCP/IP path usually becomes a major performance bottleneck for applications performing NVMe-over-Fabrics (NVMe-oF) based I/O operations in disaggregated storage settings. We propose an adaptive communication channel, called NVMe-over-Adaptive-Fabric (NVMe-oAF), that applications could leverage to eliminate the high-latency and low-bandwidth incurred by remote I/O requests over TCP/IP. NVMe-oAF accelerates I/O intensive applications using locality awareness along with optimized shared memory and TCP/IP paths. The adaptiveness of the fabric stems from the ability to adaptively select shared memory or TCP channel and further applying optimizations for the chosen channel. To evaluate NVMe-oAF, we co-design Intel's SPDK library with our designs and show up to 7.1x bandwidth improvement and up to 4.2x latency reduction for various workloads over commodity TCP/IP-based Ethernet networks (e.g., 10Gbps, 25Gbps, and 100Gbps). We achieve similar (or sometimes better) performance when compared to NVMe-over-RDMA by avoiding the cumbersome management of RDMA in HPC cloud environments. Finally, we also co-design NVMe-oAF with H5bench to showcase the benefit it brings to HDF5 applications. Our evaluation indicates up to a 7x bandwidth improvement when compared with the network file system (NFS).}, booktitle = {Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing}, pages = {56–70}, numpages = {15}, keywords = {nvme-over-fabrics, shared memory, hpc cloud, spdk}, location = {Minneapolis, MN, USA}, pdf = {https://dl.acm.org/doi/pdf/10.1145/3502181.3531476}, series = {HPDC '22} }
Conference Proceedings
[HiPC '22] HiBGT: High-Performance Bayesian Group Testing for COVID-19
Weicong Chen, Xiaoyi Lu, Curtis Tatsuoka

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022 (Acceptance Rate: 26%, 35/131)
- Citation
- PDF
Cite "HiBGT: High-Performance Bayesian Group Testing for COVID-19"
- Plain text
- BibTeX
Weicong Chen, Xiaoyi Lu, and Curtis Tatsuoka. Hibgt: high-performance bayesian group testing for covid-19. In Proceedings of the 29th IEEE International Conference on High Performance Computing, HiPC '22. IEEE, 2022. Acceptance Rate: 26%, 35/131.
@inproceedings{conf-hipc-Chen22, author = {Chen, Weicong and Lu, Xiaoyi and Tatsuoka, Curtis}, booktitle = {Proceedings of the 29th IEEE International Conference on High Performance Computing}, publisher = {IEEE}, title = {HiBGT: High-Performance Bayesian Group Testing for COVID-19}, series = {HiPC '22}, note = {Acceptance Rate: 26\%, 35/131}, pdf = {https://sites.ucmerced.edu/files/luxi/files/hipc22-cr.pdf}, year = 2022 }
Journal Article
[Biostatistics '22] Bayesian Group Testing with Dilution Effects
Curtis Tatsuoka, Weicong Chen, Xiaoyi Lu

Biostatistics, 2022 (kxac004)
Abstract: A Bayesian framework for group testing under dilution effects has been developed, using lattice-based models. This work has particular relevance given the pressing public health need to enhance testing capacity for coronavirus disease 2019 and future pandemics, and the need for wide-scale and repeated testing for surveillance under constantly varying conditions. The proposed Bayesian approach allows for dilution effects in group testing and for general test response distributions beyond just binary outcomes. It is shown that even under strong dilution effects, an intuitive group testing selection rule that relies on the model order structure, referred to as the Bayesian halving algorithm, has attractive optimal convergence properties. Analogous look-ahead rules that can reduce the number of stages in classification by selecting several pooled tests at a time are proposed and evaluated as well. Group testing is demonstrated to provide great savings over individual testing in the number of tests needed, even for moderately high prevalence levels. However, there is a trade-off with higher number of testing stages, and increased variability. A web-based calculator is introduced to assist in weighing these factors and to guide decisions on when and how to pool under various conditions. High-performance distributed computing methods have also been implemented for considering larger pool sizes, when savings from group testing can be even more dramatic.
Cite "Bayesian Group Testing with Dilution Effects"
- Plain text
- BibTeX
Curtis Tatsuoka, Weicong Chen, and Xiaoyi Lu. Bayesian Group Testing with Dilution Effects. Biostatistics, 04 2022. kxac004. URL: https://doi.org/10.1093/biostatistics/kxac004, arXiv:https://academic.oup.com/biostatistics/advance-article-pdf/doi/10.1093/biostatistics/kxac004/43355708/kxac004.pdf, doi:10.1093/biostatistics/kxac004.
@article{journals-biostatistics-ben22, author = {Tatsuoka, Curtis and Chen, Weicong and Lu, Xiaoyi}, title = "{Bayesian Group Testing with Dilution Effects}", journal = {Biostatistics}, year = {2022}, month = {04}, abstract = "{A Bayesian framework for group testing under dilution effects has been developed, using lattice-based models. This work has particular relevance given the pressing public health need to enhance testing capacity for coronavirus disease 2019 and future pandemics, and the need for wide-scale and repeated testing for surveillance under constantly varying conditions. The proposed Bayesian approach allows for dilution effects in group testing and for general test response distributions beyond just binary outcomes. It is shown that even under strong dilution effects, an intuitive group testing selection rule that relies on the model order structure, referred to as the Bayesian halving algorithm, has attractive optimal convergence properties. Analogous look-ahead rules that can reduce the number of stages in classification by selecting several pooled tests at a time are proposed and evaluated as well. Group testing is demonstrated to provide great savings over individual testing in the number of tests needed, even for moderately high prevalence levels. However, there is a trade-off with higher number of testing stages, and increased variability. A web-based calculator is introduced to assist in weighing these factors and to guide decisions on when and how to pool under various conditions. High-performance distributed computing methods have also been implemented for considering larger pool sizes, when savings from group testing can be even more dramatic.}", issn = {1465-4644}, doi = {10.1093/biostatistics/kxac004}, url = {https://doi.org/10.1093/biostatistics/kxac004}, note = {kxac004}, eprint = {https://academic.oup.com/biostatistics/advance-article-pdf/doi/10.1093/biostatistics/kxac004/43355708/kxac004.pdf}, series = {Biostatistics '22} }
Journal Article
[TBench '22] Understanding hot interconnects with an extensive benchmark survey
Yuke Li, Hao Qi, Gang Lu, Feng Jin, Yanfei Guo, Xiaoyi Lu

BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2022
- Abstract
- Keywords
- Citation
- Full text
Keywords: Benchmarks, Interconnects, RDMA

Abstract: Understanding the designs and performance characterizations of hot interconnects on modern data center and high-performance computing (HPC) clusters is a fruitful research topic in recent years. The rapid and continuous growth of high-bandwidth and low-latency communication requirements for various types of data center and HPC applications (such as big data, deep learning, and microservices) has been pushing the envelope of advanced interconnect designs. We believe this is high time to investigate the performance characterizations of representative hot interconnects with different benchmarks. Hence, this paper presents an extensive survey of state-of-the-art hot interconnects on data center and HPC clusters and the associated representative benchmarks to help the community to better understand modern interconnects. In addition, we characterize these interconnects by the related benchmarks under different application scenarios. We provide our perspectives on benchmarking data center interconnects based on our survey, experiments, and results.
Cite "Understanding hot interconnects with an extensive benchmark survey"
- Plain text
- BibTeX
Yuke Li, Hao Qi, Gang Lu, Feng Jin, Yanfei Guo, and Xiaoyi Lu. Understanding hot interconnects with an extensive benchmark survey. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2(3):100074, 2022. URL: https://www.sciencedirect.com/science/article/pii/S2772485922000618, doi:https://doi.org/10.1016/j.tbench.2022.100074.
@article{journals-tbench-yukeli22, title = {Understanding hot interconnects with an extensive benchmark survey}, journal = {BenchCouncil Transactions on Benchmarks, Standards and Evaluations}, volume = {2}, number = {3}, pages = {100074}, year = {2022}, issn = {2772-4859}, doi = {https://doi.org/10.1016/j.tbench.2022.100074}, url = {https://www.sciencedirect.com/science/article/pii/S2772485922000618}, author = {Yuke Li and Hao Qi and Gang Lu and Feng Jin and Yanfei Guo and Xiaoyi Lu}, series = {TBench '22}, keywords = {Benchmarks, Interconnects, RDMA}, abstract = {Understanding the designs and performance characterizations of hot interconnects on modern data center and high-performance computing (HPC) clusters is a fruitful research topic in recent years. The rapid and continuous growth of high-bandwidth and low-latency communication requirements for various types of data center and HPC applications (such as big data, deep learning, and microservices) has been pushing the envelope of advanced interconnect designs. We believe this is high time to investigate the performance characterizations of representative hot interconnects with different benchmarks. Hence, this paper presents an extensive survey of state-of-the-art hot interconnects on data center and HPC clusters and the associated representative benchmarks to help the community to better understand modern interconnects. In addition, we characterize these interconnects by the related benchmarks under different application scenarios. We provide our perspectives on benchmarking data center interconnects based on our survey, experiments, and results.} }
Conference Proceedings
[Bench '22] Benchmarking Object Detection Models with Mummy Nuts Dataset
Darren Ng, Colin Schmierer, Andrew Lin, Zeyu Liu, Falin Yu, Shawn Newsam, Reza Ehsani, Xiaoyi Lu

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2022
- Citation
Cite "Benchmarking Object Detection Models with Mummy Nuts Dataset"
- Plain text
- BibTeX
Darren Ng, Colin Schmierer, Andrew Lin, Zeyu Liu, Falin Yu, Shawn Newsam, Reza Ehsani, and Xiaoyi Lu. Benchmarking object detection models with mummy nuts dataset. In Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, Bench '22. Online, 2022. Springer International Publishing.
@inproceedings{conf-bench-darren22, Address = {Online}, Author = {Ng, Darren and Schmierer, Colin and Lin, Andrew and Liu, Zeyu and Yu, Falin and Newsam, Shawn and Ehsani, Reza and Lu, Xiaoyi}, Booktitle = {Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing}, Da = {2022//}, Publisher = {Springer International Publishing}, Title = {Benchmarking Object Detection Models with Mummy Nuts Dataset}, Ty = {CONF}, series = {Bench '22}, video = {https://www.youtube.com/watch?v=tjaZa7lZq3Q&ab_channel=BenchCouncil}, Year = {2022} }
Conference Proceedings
[IPDPS '21] NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics
Shashank Gugnani, Tianxi Li, Xiaoyi Lu

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2021
Keywords: Checkpoint/Restart, NVMe, NVMf, Exascale

Abstract: Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.
Cite "NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics"
- Plain text
- BibTeX
Shashank Gugnani, Tianxi Li, and Xiaoyi Lu. Nvme-cr: a scalable ephemeral storage runtime for checkpoint/restart with nvme-over-fabrics. In Proceedings of IEEE International Parallel and Distributed Processing Symposium, IPDPS '21. IEEE Computer Society, 2021.
@inproceedings{conf-ipdps-shashank21, author = {Gugnani, Shashank and Li, Tianxi and Lu, Xiaoyi}, Date-Added = {2021-02-22 19:42:43 +0000}, Date-Modified = {2021-02-03 04:02:36 +0000}, booktitle = {Proceedings of IEEE International Parallel and Distributed Processing Symposium}, crossref = {conf/ipdps/2021}, keywords = {Checkpoint/Restart, NVMe, NVMf, Exascale}, publisher = {IEEE Computer Society}, title = {NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics}, abstract = {Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density all-flash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of HPC applications obliviously. Using the ECP CoMD application as a use case, results show that on a local cluster our runtime can achieve near perfect (> 0.96) efficiency at 448 processes. Moreover, our designs can reduce checkpoint overhead by as much as 2x compared to state-of-the- art storage systems.}, series = {IPDPS '21}, year = {2021} }
Conference Proceedings
[WORDS '21] Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap
Xiaoyi Lu, Arjun Kashyap

The Second Workshop On Resource Disaggregation and Serverless (WORDS’21), co-located with ASPLOS 2021, 2021 (Vision Paper)
Keywords: Offloadable, Migratable, Microservice, Disaggregated Architectures
Cite "Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap"
- Plain text
- BibTeX
Xiaoyi Lu and Arjun Kashyap. Towards offloadable and migratable microservices on disaggregated architectures: vision, challenges, and research roadmap. In The Second Workshop On Resource Disaggregation and Serverless (WORDS’21), co-located with ASPLOS 2021, WORDS '21. ACM, 2021. Vision Paper. URL: https://wuklab.github.io/words/words21-lu.pdf.
@inproceedings{conf-words-ms, added-at = {2021-04-16T00:00:00.000+0200}, author = {Lu, Xiaoyi and Kashyap, Arjun}, booktitle = {The Second Workshop On Resource Disaggregation and Serverless (WORDS’21), co-located with ASPLOS 2021}, crossref = {conf/words/2021}, keywords = {Offloadable, Migratable, Microservice, Disaggregated Architectures}, publisher = {ACM}, timestamp = {2021-06-25T11:40:11.000+0200}, title = {Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap}, url = {https://wuklab.github.io/words/words21-lu.pdf}, note = {Vision Paper}, series = {WORDS '21}, year = 2021 }
Conference Proceedings
[HPDC '21] DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM
Shashank Gugnani, Xiaoyi Lu

Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2021
- Keywords
- Citation
Keywords: Tailless, Quiescent-Free, Object Store, PMEM
Cite "DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM"
- Plain text
- BibTeX
Shashank Gugnani and Xiaoyi Lu. Dstore: a fast, tailless, and quiescent-free object store for pmem. In Proceedings of International ACM Symposium on High Performance and Distributed Computing, HPDC '21. ACM, 2021.
@inproceedings{conf-hpdc-dstore, added-at = {2021-06-25T00:00:00.000+0200}, author = {Gugnani, Shashank and Lu, Xiaoyi}, booktitle = {Proceedings of International ACM Symposium on High Performance and Distributed Computing}, crossref = {conf/hpdc/2021}, keywords = {Tailless, Quiescent-Free, Object Store, PMEM}, publisher = {ACM}, timestamp = {2021-06-25T11:40:11.000+0200}, title = {DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM}, series = {HPDC '21}, year = 2021 }

[VLDB '21] Understanding the Idiosyncrasies of Real Persistent Memory

Shashank Gugnani, Arjun Kashyap, Xiaoyi Lu

Proceedings of the VLDB Endowment, 2021

Abstract: High capacity persistent memory (PMEM) is finally commercially available in the form of Intel’s Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture. In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.

Conference Proceedings
[SC '21] HatRPC: Hint-Accelerated Thrift RPC over RDMA
Tianxi Li, Haiyang Shi, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
- Citation
Cite "HatRPC: Hint-Accelerated Thrift RPC over RDMA"
- Plain text
- BibTeX
Tianxi Li, Haiyang Shi, and Xiaoyi Lu. Hatrpc: hint-accelerated thrift rpc over rdma. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21. New York, NY, USA, 2021. Association for Computing Machinery.
@inproceedings{conf-sc-tianxi21, Author = {Li, Tianxi and Shi, Haiyang and Lu, Xiaoyi}, Booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, Location = {St. Louis, Missouri}, Publisher = {Association for Computing Machinery}, Address = {New York, NY, USA}, Series = {SC '21}, Title = {HatRPC: Hint-Accelerated Thrift RPC over RDMA}, Year = {2021} }
Conference Proceedings
[SEC '21] Characterizing and Accelerating End-to-End EdgeAI Inference Systems for Object Detection Applications
Yujie Hui, Jeffrey Lien, Xiaovi Lu

2021 IEEE/ACM Symposium on Edge Computing (SEC), 2021
- Citation
Cite "Characterizing and Accelerating End-to-End EdgeAI Inference Systems for Object Detection Applications"
- Plain text
- BibTeX
Yujie Hui, Jeffrey Lien, and Xiaovi Lu. Characterizing and accelerating end-to-end edgeai inference systems for object detection applications. In 2021 IEEE/ACM Symposium on Edge Computing (SEC), volume of SEC '21, 01-12. 2021. doi:10.1145/3453142.3491294.
@INPROCEEDINGS{conf-sec-yujie21, author={Hui, Yujie and Lien, Jeffrey and Lu, Xiaovi}, booktitle={2021 IEEE/ACM Symposium on Edge Computing (SEC)}, title={Characterizing and Accelerating End-to-End EdgeAI Inference Systems for Object Detection Applications}, year={2021}, volume={}, number={}, pages={01-12}, doi={10.1145/3453142.3491294}, series = {SEC '21} }
Conference Proceedings
[Bench '20] Impact of Commodity Networks on Storage Disaggregation with NVMe-oF
Arjun Kashyap, Shashank Gugnani, Xiaoyi Lu

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2020
- Citation
Cite "Impact of Commodity Networks on Storage Disaggregation with NVMe-oF"
- Plain text
- BibTeX
Arjun Kashyap, Shashank Gugnani, and Xiaoyi Lu. Impact of commodity networks on storage disaggregation with nvme-of. In Wanling Gao, Jianfeng Zhan, and Felix Wolf, editors, Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, Bench '20. Cham, 2020. Springer International Publishing.
@inproceedings{conf-bench-arjun20, Address = {Cham}, Author = {Kashyap, Arjun and Gugnani, Shashank and Lu, Xiaoyi}, Booktitle = {Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing}, Da = {2020//}, Date-Added = {2021-01-26 19:45:38 +0000}, Date-Modified = {2021-02-03 04:03:02 +0000}, Editor = {Gao, Wanling and Zhan, Jianfeng and Wolf, Felix}, Publisher = {Springer International Publishing}, Title = {Impact of Commodity Networks on Storage Disaggregation with NVMe-oF}, Ty = {CONF}, series = {Bench '20}, Year = {2020}}
Conference Proceedings
[SC '20] RDMP-KV: Designing Remote Direct Memory Persistence Based Key-Value Stores with PMEM
Tianxi Li, Dipti Shankar, Shashank Gugnani, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
Keywords: next generation networking, data storage systems, persistent memory, key-value stores, RDMA

Abstract: Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMP-KV). RDMP-KV employs a hybrid 'server-reply/server-bypass' approach to 'durably' store individual key-value objects on PMEM-equipped servers. RDMP-KV's runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-'Server-Reply' protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65
Cite "RDMP-KV: Designing Remote Direct Memory Persistence Based Key-Value Stores with PMEM"
- Plain text
- BibTeX
Tianxi Li, Dipti Shankar, Shashank Gugnani, and Xiaoyi Lu. Rdmp-kv: designing remote direct memory persistence based key-value stores with pmem. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.
@inproceedings{conf-sc-tianxi20, Abstract = {Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMP-KV). RDMP-KV employs a hybrid 'server-reply/server-bypass' approach to 'durably' store individual key-value objects on PMEM-equipped servers. RDMP-KV's runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-'Server-Reply' protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.}, Articleno = {52}, Author = {Li, Tianxi and Shankar, Dipti and Gugnani, Shashank and Lu, Xiaoyi}, Booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, Date-Added = {2021-01-26 19:35:59 +0000}, Date-Modified = {2021-02-03 04:03:50 +0000}, Isbn = {9781728199986}, Keywords = {next generation networking, data storage systems, persistent memory, key-value stores, RDMA}, Location = {Atlanta, Georgia}, Publisher = {IEEE Press}, Series = {SC '20}, Title = {RDMP-KV: Designing Remote Direct Memory Persistence Based Key-Value Stores with PMEM}, Year = {2020}}
Journal Article
[JCST '20] CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
Zheng-Hao Jin, Haiyang Shi, Ying-Xin Hu, Li Zha, Xiaoyi Lu

Journal of Computer Science and Technology, 2020
Keywords: CirroData;high performance;SQL-on-Hadoop;online analytical processing (OLAP);Big Data
Cite "CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance"
- Plain text
- BibTeX
Zheng-Hao Jin, Haiyang Shi, Ying-Xin Hu, Li Zha, and Xiaoyi Lu. Cirrodata: yet another sql-on-hadoop data analytics engine with high performance. Journal of Computer Science and Technology, 35(1):194, 2020. URL: http://jcst.ict.ac.cn/EN/abstract/article_2598.shtml, doi:10.1007/s11390-020-9536-z.
@article{journals-jcst-JinSHZL20, author = {Jin, Zheng-Hao and Shi, Haiyang and Hu, Ying-Xin and Zha, Li and Lu, Xiaoyi}, title = {CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance}, publisher = {Journal of Computer Science and Technology}, year = {2020}, journal = {Journal of Computer Science and Technology}, volume = {35}, number = {1}, eid = {194}, numpages = {14}, pages = {194}, keywords = {CirroData;high performance;SQL-on-Hadoop;online analytical processing (OLAP);Big Data}, url = {http://jcst.ict.ac.cn/EN/abstract/article_2598.shtml}, series = {JCST '20}, doi = {10.1007/s11390-020-9536-z} }
Conference Proceedings
[SC '20] INEC: Fast and Coherent in-Network Erasure Coding
Haiyang Shi, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
Keywords: erasure coding, fault tolerance, in-network computing, next generation networking

Abstract: Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57
Cite "INEC: Fast and Coherent in-Network Erasure Coding"
- Plain text
- BibTeX
Haiyang Shi and Xiaoyi Lu. Inec: fast and coherent in-network erasure coding. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.
@inproceedings{conf-sc-haiyang20, author = {Shi, Haiyang and Lu, Xiaoyi}, title = {INEC: Fast and Coherent in-Network Erasure Coding}, year = {2020}, isbn = {9781728199986}, publisher = {IEEE Press}, abstract = {Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {66}, numpages = {17}, keywords = {erasure coding, fault tolerance, in-network computing, next generation networking}, location = {Atlanta, Georgia}, series = {SC '20} }

[Bench '19] Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads. (Best Paper Award Nomination)

Yujie Hui, Jeffrey Lien, Xiaoyi Lu

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2019

Keywords: dblp

Conference Proceedings
[NVMW '19] Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence
Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of Annual Non-Volatile Memories Workshop, 2019 (Poster Paper)
- Keywords
- Citation
Keywords: Non-Volatile Memory, In-Memory Datastore, Remote Direct Memory Persistence
Cite "Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence"
- Plain text
- BibTeX
Dipti Shankar, Xiaoyi Lu, and Dhabaleswar K. Panda. Accelerating nvram-aware in-memory datastore with remote direct memory persistence. In Proceedings of Annual Non-Volatile Memories Workshop, NVMW '19. 2019. Poster Paper.
@inproceedings{conf-nvmw-shankar19, author = {Shankar, Dipti and Lu, Xiaoyi and Panda, Dhabaleswar K.}, Date-Added = {2019-02-22 19:42:43 +0000}, booktitle = {Proceedings of Annual Non-Volatile Memories Workshop}, crossref = {conf/nvmw/2019}, keywords = {Non-Volatile Memory, In-Memory Datastore, Remote Direct Memory Persistence}, title = {Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence}, series = {NVMW '19}, note = {Poster Paper}, year = {2019} }
Conference Proceedings
[SC '19] Three-Dimensional Characterization on Edge AI Processors with Object Detection Workloads
Yujie Hui, Jeffrey Lien, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019 (Poster Paper)
- Keywords
- Citation
Keywords: Edge Computing, AI, Object Detection
Cite "Three-Dimensional Characterization on Edge AI Processors with Object Detection Workloads"
- Plain text
- BibTeX
Yujie Hui, Jeffrey Lien, and Xiaoyi Lu. Three-dimensional characterization on edge ai processors with object detection workloads. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19. New York, NY, USA, 2019. Association for Computing Machinery. Poster Paper.
@inproceedings{conf-sc-hui19, author = {Hui, Yujie and Lien, Jeffrey and Lu, Xiaoyi}, title = {Three-Dimensional Characterization on Edge AI Processors with Object Detection Workloads}, year = {2019}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {Edge Computing, AI, Object Detection}, location = {Denver, Colorado}, note = {Poster Paper}, series = {SC '19} }
Journal Article
[TPDS '19] Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast
Ching-Hsiang Chu, Xiaoyi Lu, Ammar Awan, Hari Subramoni, Bracy Elton, Dhabaleswar K. Panda

IEEE Transactions on Parallel and Distributed Systems, 2019
Keywords: graphics processing units;learning (artificial intelligence);message passing;neural nets;parallel processing;performance evaluation;model-oriented analysis;GPU clusters;streaming-based broadcast schemes;InfiniBand hardware multicast;IB-MCAST;NVIDIA GPUDirect technology;streaming learning applications;message transmission;deep learning;high-performance computing;graphics processing unit-based applications;message passing interface;remote direct memory access technology;GPUDirect RDMA technology;HPC clusters;performance evaluation;Graphics processing units;Hardware;Analytical models;Machine learning;Clustering algorithms;Scalability;Bandwidth;Broadcast;deep learning;hardware multicast;GPU;GPUDirect RDMA;heterogeneous broadcast;streaming

Abstract: Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.
Cite "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast"
- Plain text
- BibTeX
Ching-Hsiang Chu, Xiaoyi Lu, Ammar Awan, Hari Subramoni, Bracy Elton, and Dhabaleswar K. Panda. Exploiting hardware multicast and gpudirect rdma for efficient broadcast. IEEE Transactions on Parallel and Distributed Systems, 30(3):575-588, March 2019. doi:10.1109/TPDS.2018.2867222.
@ARTICLE{journals-debu-LuSP17, author={Chu, Ching-Hsiang and Lu, Xiaoyi and Awan, Ammar and Subramoni, Hari and Elton, Bracy and Panda, Dhabaleswar K.}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast}, year={2019}, volume={30}, number={3}, pages={575-588}, abstract={Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.}, keywords={graphics processing units;learning (artificial intelligence);message passing;neural nets;parallel processing;performance evaluation;model-oriented analysis;GPU clusters;streaming-based broadcast schemes;InfiniBand hardware multicast;IB-MCAST;NVIDIA GPUDirect technology;streaming learning applications;message transmission;deep learning;high-performance computing;graphics processing unit-based applications;message passing interface;remote direct memory access technology;GPUDirect RDMA technology;HPC clusters;performance evaluation;Graphics processing units;Hardware;Analytical models;Machine learning;Clustering algorithms;Scalability;Bandwidth;Broadcast;deep learning;hardware multicast;GPU;GPUDirect RDMA;heterogeneous broadcast;streaming}, doi={10.1109/TPDS.2018.2867222}, ISSN={1558-2183}, series={TPDS '19}, month={March},}

[THPC '19] Performance analysis of deep learning workloads using roofline trajectories

M. Haseeb Javed, Khaled Z. Ibrahim, Xiaoyi Lu

CCF Transactions on High Performance Computing, 2019 (Invited Paper)

Abstract: Over the last decade, technologies derived from convolutional neural networks (CNNs) called Deep Learning applications, have revolutionized fields as diverse as cancer detection, self-driving cars, virtual assistants, etc. However, many users of such applications are not experts in Machine Learning itself. Consequently, there is limited knowledge among the community to run such applications in an optimized manner. The performance question for Deep Learning applications has typically been addressed by employing bespoke hardware (e.g., GPUs) better suited for such compute-intensive operations. However, such a degree of performance is only accessibly at increasingly high financial costs leaving only big corporations and governments with resources sufficient enough to employ them at a large scale. As a result, an average user is only left with access to commodity clusters with, in many cases, only CPUs as the sole processing element. For such users to make effective use of resources at their disposal, concerted efforts are necessary to figure out optimal hardware and software configurations. This study is one such step in this direction as we use the Roofline model to perform a systematic analysis of representative CNN models and identify opportunities for black box and application-aware optimizations. Using the findings from our study, we are able to obtain up to 3.5\$\$\times \$\$×speedup compared to vanilla TensorFlow with default configurations.

[IISWC '19] SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures. (Best Paper Award Nomination)

Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. D. K. Panda

Proceedings of IEEE International Symposium on Workload Characterization, 2019

Keywords: dblp

[SC '19] TriEC: Tripartite Graph Based Erasure Coding NIC Offload (Best Student Paper Finalist)

Haiyang Shi, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019

Keywords: NIC offload, tripartite, bipartite, erasure coding

Abstract: Erasure Coding (EC) NIC offload is a promising technology for designing next-generation distributed storage systems. However, this paper has identified three major limitations of current-generation EC NIC offload schemes on modern SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the tripartite graph model, namely TriEC. TriEC supports both encode-and-send and receive-and-decode operations efficiently. Through theorem-based proofs, co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we show that TriEC is correct and can deliver better performance than the state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding and recovering, respectively. With extended YCSB workloads, TriEC reduces the average write latency by up to 23.2

[HPDC '19] UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems.

Haiyang Shi, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2019

Keywords: dblp

[HiPC '19] SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures.

Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2019

Keywords: dblp

Conference Proceedings
[SC '19] Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems
Haiyang Shi, Xiaoyi Lu

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019 (ACM Student Research Competition Poster)
- Keywords
- Citation
Keywords: Erasure Coding, Storage Systems, SmartNIC, GPGPU
Cite "Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems"
- Plain text
- BibTeX
Haiyang Shi and Xiaoyi Lu. Designing high-performance erasure coding schemes for next-generation storage systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19. New York, NY, USA, 2019. Association for Computing Machinery. ACM Student Research Competition Poster.
@inproceedings{conf-sc-shi19-poster, author = {Shi, Haiyang and Lu, Xiaoyi}, title = {Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems}, year = {2019}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, keywords = {Erasure Coding, Storage Systems, SmartNIC, GPGPU}, location = {Denver, Colorado}, note = {ACM Student Research Competition Poster}, series = {SC '19} }

[IPDPS '19] C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.

Jie Zhang, Xiaoyi Lu, Ching-Hsiang Chu, Dhabaleswar K. Panda

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2019

Keywords: dblp

[HiPC '18] OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2018

Keywords: dblp

[FITEE '18] Networking and communication challenges for post-exascale systems

Dhabaleswar Panda, Xiaoyi Lu, Hari Subramoni

Frontiers of Information Technology & Electronic Engineering, 2018 (Invited Vision Paper)

Abstract: With the significant advancement in emerging processor, memory, and networking technologies, exascale systems will become available in the next few years (2020–2022). As the exascale systems begin to be deployed and used, there will be a continuous demand to run next-generation applications with finer granularity, finer time-steps, and increased data sizes. Based on historical trends, next-generation applications will require postexascale systems during 2025–2035. In this study, we focus on the networking and communication challenges for post-exascale systems. Firstly, we present an envisioned architecture for post-exascale systems. Secondly, the challenges are summarized from different perspectives: heterogeneous networking technologies, high-performance communication and synchronization protocols, integrated support with accelerators and field-programmable gate arrays, fault-tolerance and quality-of-service support, energy-aware communication schemes and protocols, softwaredefined networking, and scalable communication protocols with heterogeneous memory and storage. Thirdly, we present the challenges in designing efficient programming model support for high-performance computing, big data, and deep learning on these systems. Finally, we emphasize the critical need for co-designing runtime with upper layers on these systems to achieve the maximum performance and scalability.

Journal Article
[TMSCS '18] DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters
Xiaoyi Lu, Haiyang Shi, Rarjashi Biswas, M. Haseeb Javed, Dhabaleswar K. Panda

IEEE Transactions on Multi-Scale Computing Systems, 2018
- Keywords
- Citation
Keywords: machine learning;big data;graphics processing units;training data;parallel processing;deep learning
Cite "DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters"
- Plain text
- BibTeX
Xiaoyi Lu, Haiyang Shi, Rarjashi Biswas, M. Haseeb Javed, and Dhabaleswar K. Panda. Dlobd: a comprehensive study of deep learning over big data stacks on hpc clusters. IEEE Transactions on Multi-Scale Computing Systems, 4(04):635-648, oct 2018. doi:10.1109/TMSCS.2018.2845886.
@ARTICLE {journals-tmscs-LuSBJP18, author = {Lu, Xiaoyi and Shi, Haiyang and Biswas, Rarjashi and Javed, M. Haseeb and Panda, Dhabaleswar K.}, journal = {IEEE Transactions on Multi-Scale Computing Systems}, title = {DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters}, year = {2018}, volume = {4}, number = {04}, issn = {2332-7766}, pages = {635-648}, keywords = {machine learning;big data;graphics processing units;training data;parallel processing;deep learning}, doi = {10.1109/TMSCS.2018.2845886}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, series = {TMSCS '18}, month = {oct} }

[Bench '18] HPC AI500: A Benchmark Suite for HPC AI Systems.

Zihan Jiang, Wanling Gao, Lei Wang, Xingwang Xiong, Yuchen Zhang, Xu Wen, Chunjie Luo, Hainan Ye, Xiaoyi Lu, Yunquan Zhang, Shengzhong Feng, Kenli Li, Weijia Xu, Jianfeng Zhan

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

Keywords: dblp

[BigData '18] Spark-uDAPL: Cost-Saving Big Data Analytics on Microsoft Azure Cloud with RDMA Networks*.

Xiaoyi Lu, Dipti Shankar, Haiyang Shi, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2018 (Short Paper)

Keywords: dblp

Conference Proceedings
[BPOE '18] Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences
Rajarshi Biswas, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2018
- Citation
- PDF
- Full text
Cite "Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences"
- Plain text
- BibTeX
Rajarshi Biswas, Xiaoyi Lu, and Dhabaleswar K. Panda. Designing a micro-benchmark suite to evaluate grpc for tensorflow: early experiences. In Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, BPOE '18. 2018. URL: https://arxiv.org/abs/1804.01138.
@inproceedings{conf-bpoe-rajarshi, author = {Biswas, Rajarshi and Lu, Xiaoyi and Panda, Dhabaleswar K.}, booktitle = {Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware}, crossref = {conf/bpoe/2018}, title = {Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences}, url = {https://arxiv.org/abs/1804.01138}, pdf = {https://arxiv.org/pdf/1804.01138.pdf}, series = {BPOE '18}, year = 2018 }

[EuroMPI '18] Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.

Mingzhe Li, Xiaoyi Lu, Hari Subramoni, Dhabaleswar K. Panda

Proceedings of International Conference on EuroMPI, 2018

Keywords: dblp

Conference Proceedings
[ISC '18] Can Unified-Memory Support on Pascal and Volta GPUs enable Out-of-Core DNN Training? (Best Poster Paper Award)
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of the International Supercomputing Conference, 2018 (Poster Paper)
- Keywords
- Citation
- PDF
Keywords: Unified Memory, Pascal, Volta, Out-of-Core DNN Training
Cite "Can Unified-Memory Support on Pascal and Volta GPUs enable Out-of-Core DNN Training?"
- Plain text
- BibTeX
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. Can unified-memory support on pascal and volta gpus enable out-of-core dnn training? In Proceedings of the International Supercomputing Conference, ISC '18. Springer, 2018. Poster Paper.
@inproceedings{conf-isc-awan18, added-at = {2018-07-23T00:00:00.000+0200}, author = {Awan, Ammar Ahmad and Chu, Ching-Hsiang and Subramoni, Hari and Lu, Xiaoyi and Panda, Dhabaleswar K.}, booktitle = {Proceedings of the International Supercomputing Conference}, crossref = {conf/supercomputer/2018}, keywords = {Unified Memory, Pascal, Volta, Out-of-Core DNN Training}, publisher = {Springer}, title = {Can Unified-Memory Support on Pascal and Volta GPUs enable Out-of-Core DNN Training?}, pdf = {https://www.isc-hpc.com/agenda2018/conferences/isc_hpc/assets/2018/posters/post127.pdf}, highlight = {Best Poster Paper Award}, series = {ISC '18}, note = {Poster Paper}, year = 2018 }

[Bench '18] A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?

Qin Zhang, Li Zha, Jian Lin, Dandan Tu, Mingzhe Li, Fan Liang, Ren Wu, Xiaoyi Lu

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

Keywords: dblp

Journal Article
[JPDC '18] MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters
Md. Wasi-ur- Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. (DK) Panda

Journal of Parallel and Distributed Computing, 2018
- Abstract
- Keywords
- Citation
- Full text
Keywords: MapReduce, HPC, Tuning, Prediction

Abstract: MapReduce is the most popular parallel computing framework for big data processing which allows massive scalability across distributed computing environment. Advanced RDMA-based design of Hadoop MapReduce has been proposed that alleviates the performance bottlenecks in default Hadoop MapReduce by leveraging the benefits from RDMA. On the other hand, data processing engine, Spark, provides fast execution of MapReduce applications through in-memory processing. Performance optimization for these contemporary big data processing frameworks on modern High-Performance Computing (HPC) systems is a formidable task because of the numerous configuration possibilities in each of them. In this paper, we propose MR-Advisor, a comprehensive tuning, profiling, and prediction tool for MapReduce. MR-Advisor is generalized to provide performance optimizations for Hadoop, Spark, and RDMA-enhanced Hadoop MapReduce designs over different file systems such as HDFS, Lustre, and Tachyon. Performance evaluations reveal that, with MR-Advisor’s suggested values, the job execution performance can be enhanced by a maximum of 58
Cite "MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters"
- Plain text
- BibTeX
Md. Wasi-ur- Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. (DK) Panda. Mr-advisor: a comprehensive tuning, profiling, and prediction tool for mapreduce execution frameworks on hpc clusters. Journal of Parallel and Distributed Computing, 120:237 - 250, 2018. URL: http://www.sciencedirect.com/science/article/pii/S0743731517303027, doi:https://doi.org/10.1016/j.jpdc.2017.11.004.
@article{journals-jpdc-Wasi-ur-RahmanI18, title = "MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters", journal = "Journal of Parallel and Distributed Computing", volume = "120", pages = "237 - 250", year = "2018", issn = "0743-7315", doi = "https://doi.org/10.1016/j.jpdc.2017.11.004", url = "http://www.sciencedirect.com/science/article/pii/S0743731517303027", author = "Md. Wasi-ur- Rahman and Nusrat Sharmin Islam and Xiaoyi Lu and Dipti Shankar and Dhabaleswar K. (DK) Panda", keywords = "MapReduce, HPC, Tuning, Prediction", series = "JPDC '18", abstract = "MapReduce is the most popular parallel computing framework for big data processing which allows massive scalability across distributed computing environment. Advanced RDMA-based design of Hadoop MapReduce has been proposed that alleviates the performance bottlenecks in default Hadoop MapReduce by leveraging the benefits from RDMA. On the other hand, data processing engine, Spark, provides fast execution of MapReduce applications through in-memory processing. Performance optimization for these contemporary big data processing frameworks on modern High-Performance Computing (HPC) systems is a formidable task because of the numerous configuration possibilities in each of them. In this paper, we propose MR-Advisor, a comprehensive tuning, profiling, and prediction tool for MapReduce. MR-Advisor is generalized to provide performance optimizations for Hadoop, Spark, and RDMA-enhanced Hadoop MapReduce designs over different file systems such as HDFS, Lustre, and Tachyon. Performance evaluations reveal that, with MR-Advisor’s suggested values, the job execution performance can be enhanced by a maximum of 58\% over the current best-practice values for user-level configuration parameters. To the best of our knowledge, this is the first tool that supports tuning and prediction for both Apache Hadoop and Spark, as well as the RDMA and Lustre-based advanced designs." }

[UCC '18] Analyzing, Modeling, and Provisioning QoS for NVMe SSDs.

Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2018

Keywords: dblp

Conference Proceedings
[NVMW '18] Accelerating MapReduce and DAG Execution Frameworks with Non-Volatile Memory and RDMA
Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda

Proceedings of Annual Non-Volatile Memories Workshop, 2018 (Poster Paper)
- Keywords
- Citation
Keywords: Non-Volatile Memory, RDMA, MapReduce, DAG
Cite "Accelerating MapReduce and DAG Execution Frameworks with Non-Volatile Memory and RDMA"
- Plain text
- BibTeX
Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, and Dhabaleswar K. Panda. Accelerating mapreduce and dag execution frameworks with non-volatile memory and rdma. In Proceedings of Annual Non-Volatile Memories Workshop, NVMW '18. 2018. Poster Paper.
@inproceedings{conf-nvmw-lu18, author = {Lu, Xiaoyi and Wasi-ur-Rahman, Md. and Islam, Nusrat and Panda, Dhabaleswar K.}, Date-Added = {2018-02-22 19:42:43 +0000}, booktitle = {Proceedings of Annual Non-Volatile Memories Workshop}, crossref = {conf/nvmw/2018}, keywords = {Non-Volatile Memory, RDMA, MapReduce, DAG}, title = {Accelerating MapReduce and DAG Execution Frameworks with Non-Volatile Memory and RDMA}, series = {NVMW '18}, note = {Poster Paper}, year = {2018} }

[Bench '18] EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures. (Best Paper Award)

Haiyang Shi, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of International Symposium on Benchmarking, Measuring, and Optimizing, 2018

Keywords: dblp

[HiPC '18] Accelerating TensorFlow with Adaptive RDMA-Based gRPC.

Rajarshi Biswas, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2018

Keywords: dblp

[SoCC '18] High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences.

Haiyang Shi, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of ACM Symposium on Cloud Computing, 2018 (Poster Paper)

Keywords: dblp

[CLUSTER '18] Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing.

M. Haseeb Javed, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2018

Keywords: dblp

[HotI '17] Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-Capable Networks.

Xiaoyi Lu, Haiyang Shi, M. Haseeb Javed, Rajarshi Biswas, Dhabaleswar K. Panda

Proceedings of Annual Symposium on High-Performance Interconnects, 2017

Keywords: dblp

[ICDCS '17] High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads.

Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Distributed Computing Systems, 2017

Keywords: dblp

Journal Article
[TPDS '17] A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters
Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

IEEE Transactions on Parallel and Distributed Systems, 2017
Keywords: Big Data;distributed databases;network operating systems;parallel architectures;resource allocation;storage management;storage media;workstation clusters;intermediate data placement;HPC clusters;high performance interconnects;parallel file systems;high performance computing clusters;data analytics;Big Data;HPC technologies;local storage media;Lustre-based global storage;MapReduce over Lustre deployments;high-performance YARN MapReduce design;storage provider;intermediate data storage;priority directory selection;RDMA-enhanced MapReduce;shuffle-intensive workloads;leadership-class HPC systems;job execution;Computer architecture;Servers;High performance computing;Data analysis;Big data;Memory;Big data;high performance computing;RDMA;MapReduce;lustre

Abstract: With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustrebased global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMAenhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.
Cite "A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters"
- Plain text
- BibTeX
Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, and Dhabaleswar K. Panda. A comprehensive study of mapreduce over lustre for intermediate data placement and shuffle strategies on hpc clusters. IEEE Transactions on Parallel and Distributed Systems, 28(3):633-646, March 2017. doi:10.1109/TPDS.2016.2591947.
@ARTICLE{journals-tpds-Wasi-ur-RahmanI17, author={Wasi-ur-Rahman, Md. and Islam, Nusrat Sharmin and Lu, Xiaoyi and Panda, Dhabaleswar K.}, journal={IEEE Transactions on Parallel and Distributed Systems}, title={A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters}, year={2017}, volume={28}, number={3}, pages={633-646}, abstract={With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustrebased global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMAenhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.}, keywords={Big Data;distributed databases;network operating systems;parallel architectures;resource allocation;storage management;storage media;workstation clusters;intermediate data placement;HPC clusters;high performance interconnects;parallel file systems;high performance computing clusters;data analytics;Big Data;HPC technologies;local storage media;Lustre-based global storage;MapReduce over Lustre deployments;high-performance YARN MapReduce design;storage provider;intermediate data storage;priority directory selection;RDMA-enhanced MapReduce;shuffle-intensive workloads;leadership-class HPC systems;job execution;Computer architecture;Servers;High performance computing;Data analysis;Big data;Memory;Big data;high performance computing;RDMA;MapReduce;lustre}, doi={10.1109/TPDS.2016.2591947}, ISSN={1558-2183}, series = {TPDS '17}, month={March},}

[BigData '17] Performance characterization and acceleration of big data workloads on OpenPOWER system.

Xiaoyi Lu, Haiyang Shi, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2017

Keywords: dblp

[HiPC '17] MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI.

Shashank Gugnani, Xiaoyi Lu, Franco Pestilli, Cesar F. Caiafa, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2017

Keywords: dblp

[BDCAT '17] Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka.

M. Haseeb Javed, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2017

Keywords: dblp

[SC '17] Scalable reduction collectives with data partitioning-based multi-leader design.

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2017

Keywords: dblp

[VEE '17] Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand.

Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2017

Keywords: dblp

[CLUSTER '17] A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.

Hari Subramoni, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2017 (Short Paper)

Keywords: dblp

[BigData '17] NVMD: Non-volatile memory assisted design for accelerating MapReduce and DAG execution frameworks on HPC systems.

Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2017 (Short Paper)

Keywords: dblp

Conference Proceedings
[NVMW '17] NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics
Xiaoyi Lu, Nusrat Islam, Md. Wasi-ur-Rahman, Dhabaleswar K. Panda

Proceedings of Annual Non-Volatile Memories Workshop, 2017 (Position Paper)
- Keywords
- Citation
Keywords: Non-Volatile Memory, RDMA, Big Data
Cite "NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics"
- Plain text
- BibTeX
Xiaoyi Lu, Nusrat Islam, Md. Wasi-ur-Rahman, and Dhabaleswar K. Panda. Nrcio: nvm-aware rdma-based communication and i/o schemes for big data analytics. In Proceedings of Annual Non-Volatile Memories Workshop, NVMW '17. 2017. Position Paper.
@inproceedings{conf-nvmw-lu17, author = {Lu, Xiaoyi and Islam, Nusrat and Wasi-ur-Rahman, Md. and Panda, Dhabaleswar K.}, Date-Added = {2017-02-22 19:42:43 +0000}, booktitle = {Proceedings of Annual Non-Volatile Memories Workshop}, crossref = {conf/nvmw/2017}, keywords = {Non-Volatile Memory, RDMA, Big Data}, title = {NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics}, series = {NVMW '17}, note = {Position Paper}, year = {2017} }

[Chapter '17] Building Efficient HPC Cloud with SR-IOV-Enabled InfiniBand: The MVAPICH2 Approach.

Xiaoyi Lu, Jie Zhang, Dhabaleswar K. Panda

Research Advances in Cloud Computing, 2017

Keywords: dblp

[UCC '17] Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? (Best Student Paper Award)

Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017

Keywords: dblp

[BigData '17] Characterizing and accelerating indexing techniques on distributed ordered tables.

Shashank Gugnani, Xiaoyi Lu, Houliang Qi, Li Zha, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2017

Keywords: dblp

[ICPP '17] Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

Ching-Hsiang Chu, Xiaoyi Lu, Ammar Ahmad Awan, Hari Subramoni, Jahanzeb Maqbool Hashmi, Bracy Elton, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2017

Keywords: dblp

[TCDE '17] Scalable and Distributed Key-Value Store-based Data Management Using RDMA-Memcached.

Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

IEEE Data Eng. Bull., Bulletin of the Technical Committee on Data Engineering, 2017 (Invited Paper)

Keywords: dblp

[CCGrid '17] Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud.

Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2017

Keywords: dblp

[UCC '17] HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications.

Dhabaleswar K. Panda, Xiaoyi Lu

Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing, 2017

Keywords: dblp

[HiPC '17] Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.

Mingzhe Li, Xiaoyi Lu, Hari Subramoni, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2017

Keywords: dblp

[IPDPS '17] High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters.

Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2017

Keywords: dblp

[BDCAT '16] Performance characterization of hadoop workloads on SR-IOV-enabled virtualized InfiniBand clusters.

Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, 2016

Keywords: dblp

[SBAC-PAD '16] MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers.

Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of International Symposium on Computer Architecture and High Performance Computing, 2016

Keywords: dblp

[BigData '16] High-performance design of apache spark with RDMA and its benefits on various workloads.

Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2016

Keywords: dblp

[IPDPS '16] High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits.

Dipti Shankar, Xiaoyi Lu, Nusrat S. Islam, Md. Wasi ur Rahman, Dhabaleswar K. Panda

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016

Keywords: dblp

[CloudCom '16] Designing Virtualization-Aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-Enabled Clouds.

Shashank Gugnani, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016

Keywords: dblp

[Euro-Par '16] Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem.

Jie Zhang, Xiaoyi Lu, Sourav Chakraborty, Dhabaleswar K. Panda

Proceedings of International European Conference on Parallel Processing, 2016

Keywords: dblp

[ICPP '16] High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters.

Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2016

Keywords: dblp

[SC '16] Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Hari Subramoni, Jie Zhang, Dhabaleswar K. Panda

Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2016

Keywords: dblp

[ICS '16] High Performance Design for HDFS with Byte-Addressability of NVM and RDMA.

Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of International Conference on Supercompuing, 2016

Keywords: dblp

[JSC '16] Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

Dipti Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda

The Journal of Supercomputing, 2016

Abstract: With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15

Conference Proceedings
[BPOE '16] Characterizing Cloudera Impala Workloads with BigDataBench on InfiniBand Clusters
Kunal Kulkarni, Xiayi Lu, Dhabaleswar K. Panda

Proceedings of The 7th International Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7), 2016
- Citation
Cite "Characterizing Cloudera Impala Workloads with BigDataBench on InfiniBand Clusters"
- Plain text
- BibTeX
Kunal Kulkarni, Xiayi Lu, and Dhabaleswar K. Panda. Characterizing cloudera impala workloads with bigdatabench on infiniband clusters. In Proceedings of The 7th International Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7), BPOE '16. April 2016.
@inproceedings{conf-bpoe-kunal, author={Kulkarni, Kunal and Lu, Xiayi and Panda, Dhabaleswar K.}, title={Characterizing Cloudera Impala Workloads with BigDataBench on InfiniBand Clusters}, booktitle={Proceedings of The 7th International Workshop on Big Data Benchmarks, Performance, Optimization, and Emerging Hardware (BPOE-7)}, year={2016}, month={April}, location={ Atlanta, Georgia }, series = {BPOE '16} }

[BigData '16] Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O.

Dipti Shankar, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2016 (Short Paper)

Keywords: dblp

[CloudCom '16] Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.

Xiaoyi Lu, Dipti Shankar, Shashank Gugnani, Hari Subramoni, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cloud Computing Technology and Science, 2016

Keywords: dblp

[XSEDE '16] Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet.

Mahidhar Tatineni, Xiaoyi Lu, Dong Ju Choi, Amitava Majumdar, Dhabaleswar K. Panda

Proceedings of Annual Conference on Extreme Science and Engineering Discovery Environment, 2016

Keywords: dblp

[BigData '16] Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage.

Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2016

Keywords: dblp

Book Chapter
[Chapter '16] Accelerating Big Data Processing on Modern HPC Clusters
Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar, Dhabaleswar K. (DK) Panda

Conquering Big Data with High Performance Computing, 2016
Abstract: Modern HPC systems and the associated middleware (such as MPI and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architecture, RDMA-enabled networking, and SSD) for many years. However, Big Data processing and management middleware have not fully taken advantage of such technologies. These disparities are taking HPC and Big Data processing into divergent trajectories. This chapter provides an overview of popular Big Data processing middleware, high-performance interconnects and storage architectures, and discusses the challenges in accelerating Big Data processing middleware by leveraging emerging technologies on modern HPC clusters. This chapter presents case studies of advanced designs based on RDMA and heterogeneous storage architecture, that were proposed to address these challenges for multiple components of Hadoop (HDFS and MapReduce) and Spark. The advanced designs presented in the case studies are publicly available as a part of the High-Performance Big Data (HiBD) project. An overview of the HiBD project is also provided in this chapter. All these works aim to bring HPC and Big Data processing into a convergent trajectory.
Cite "Accelerating Big Data Processing on Modern HPC Clusters"
- Plain text
- BibTeX
Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar, and Dhabaleswar K. (DK) Panda. Accelerating big data processing on modern hpc clusters. In Ritu Arora, editor, Conquering Big Data with High Performance Computing, Chapter '16, pages 81–107. Springer International Publishing, Cham, 2016. URL: https://doi.org/10.1007/978-3-319-33742-5_5, doi:10.1007/978-3-319-33742-5_5.
@incollection{books-sp-16-Lu0P16, author="Lu, Xiaoyi and Wasi-ur-Rahman, Md. and Islam, Nusrat and Shankar, Dipti and Panda, Dhabaleswar K. (DK)", editor="Arora, Ritu", title="Accelerating Big Data Processing on Modern HPC Clusters", bookTitle="Conquering Big Data with High Performance Computing", year="2016", publisher="Springer International Publishing", address="Cham", pages="81--107", abstract="Modern HPC systems and the associated middleware (such as MPI and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architecture, RDMA-enabled networking, and SSD) for many years. However, Big Data processing and management middleware have not fully taken advantage of such technologies. These disparities are taking HPC and Big Data processing into divergent trajectories. This chapter provides an overview of popular Big Data processing middleware, high-performance interconnects and storage architectures, and discusses the challenges in accelerating Big Data processing middleware by leveraging emerging technologies on modern HPC clusters. This chapter presents case studies of advanced designs based on RDMA and heterogeneous storage architecture, that were proposed to address these challenges for multiple components of Hadoop (HDFS and MapReduce) and Spark. The advanced designs presented in the case studies are publicly available as a part of the High-Performance Big Data (HiBD) project. An overview of the HiBD project is also provided in this chapter. All these works aim to bring HPC and Big Data processing into a convergent trajectory.", isbn="978-3-319-33742-5", doi="10.1007/978-3-319-33742-5_5", series="Chapter '16", url="https://doi.org/10.1007/978-3-319-33742-5_5" }

[IPDPS Workshops '16] Performance Characterization of Hypervisor-and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters.

Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda

Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2016

Keywords: dblp

[ISC '16] INAM2: InfiniBand Network Analysis and Monitoring with MPI.

Hari Subramoni, Albert Mathews Augustine, Mark Daniel Arnold, Jonathan L. Perkins, Xiaoyi Lu, Khaled Hamidouche, Dhabaleswar K. Panda

Proceedings of the International Supercomputing Conference, 2016

Keywords: dblp

[HiPC '16] Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA.

Mingzhe Li, Xiaoyi Lu, Khaled Hamidouche, Jie Zhang, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2016

Keywords: dblp

[PDSW-DISCS '16] Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?

Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, 2016

Keywords: dblp

[CCGRID '15] MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds.

Jie Zhang, Xiaoyi Lu, Mark Daniel Arnold, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015

Keywords: dblp

[CLUSTER '15] High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.

Mingzhe Li, Hari Subramoni, Khaled Hamidouche, Xiaoyi Lu, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2015

Keywords: dblp

[CCGRID '15] Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture.

Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015

Keywords: dblp

[BigData '15] Benchmarking key-value stores on high-performance storage and interconnects for web-scale workloads.

Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2015 (Short Paper)

Keywords: dblp

[ICDCS '15] Accelerating Apache Hive with MPI for Data Warehouse Systems.

Lu Chao, Chundian Li, Fan Liang, Xiaoyi Lu, Zhiwei Xu

Proceedings of IEEE International Conference on Distributed Computing Systems, 2015

Keywords: dblp

[BigDataService '15] Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework.

Jian Lin, Fan Liang, Xiaoyi Lu, Li Zha, Zhiwei Xu

Proceedings of IEEE International Conference on Big Data Computing Service and Applications, 2015 (Short Paper)

Keywords: dblp

[JCST '15] Accelerating Iterative Big Data Computing Through MPI

Fan Liang, Xiaoyi Lu

Journal of Computer Science and Technology, 2015

Abstract: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

[IPDPS Workshops '15] High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation.

Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, Dhabaleswar K. Panda

Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015

Keywords: dblp

[OpenSHMEM '15] Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM.

Jian Lin, Khaled Hamidouche, Jie Zhang, Xiaoyi Lu, Abhinav Vishnu, Dhabaleswar K. Panda

Proceedings of Workshop on OpenSHMEM and Related Technologies, 2015

Keywords: dblp

[Euro-Par '15] High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters.

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jian Lin, Dhabaleswar K. Panda

Proceedings of International European Conference on Parallel Processing, 2015

Keywords: dblp

[HiPC '15] High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR.

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jie Zhang, Jian Lin, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2015

Keywords: dblp

[BPOE '15] A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS.

Adithya Bhat, Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2015

Keywords: dblp

[IPDPS '15] High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA.

Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2015

Keywords: dblp

[ICPP '15] Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store.

Nusrat Sharmin Islam, Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2015

Keywords: dblp

[ISPASS '15] Can RDMA benefit online data processing workloads on memcached and MySQL?

Dipti Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2015 (Poster Paper)

Keywords: dblp

[BigData '15] Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters.

Nusrat Sharmin Islam, Md. Wasi ur Rahman, Xiaoyi Lu, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2015

Keywords: dblp

[BigData '14] In-memory I/O and replication for HDFS with Memcached: Early experiences.

Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Big Data, 2014 (Short Paper)

Keywords: dblp

[PGAS '14] Scalable MiniMD Design with Hybrid MPI and OpenSHMEM.

Mingzhe Li, Jian Lin, Xiaoyi Lu, Khaled Hamidouche, Karen Tomko, Dhabaleswar K. Panda

Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014

Keywords: dblp

[PPoPP '14] Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems.

Miao Luo, Xiaoyi Lu, Khaled Hamidouche, Krishna Chaitanya Kandalla, Dhabaleswar K. Panda

Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014 (Poster Paper)

Keywords: dblp

[Euro-Par '14] Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?

Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, Dhabaleswar K. Panda

Proceedings of International European Conference on Parallel Processing, 2014

Keywords: dblp

[HiPC '14] High performance MPI library over SR-IOV enabled infiniband clusters.

Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on High Performance Computing, 2014

Keywords: dblp

[NAS '14] Performance Characterization of Hadoop and Data MPI Based on Amdahl's Second Law.

Fan Liang, Chen Feng, Xiaoyi Lu, Zhiwei Xu

Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2014

Keywords: dblp

[HotI '14] Accelerating Spark with RDMA for Big Data Processing: Early Experiences.

Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dipti Shankar, Dhabaleswar K. Panda

Proceedings of Annual Symposium on High-Performance Interconnects, 2014

Keywords: dblp

[CLUSTER '14] High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design. (Best Paper Award Nomination)

Jithin Jose, Khaled Hamidouche, Xiaoyi Lu, Sreeram Potluri, Jie Zhang, Karen Tomko, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2014

Keywords: dblp

[HPDC '14] SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS.

Nusrat S. Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

Proceedings of International ACM Symposium on High Performance and Distributed Computing, 2014 (Short Paper)

Keywords: dblp

[ICPP '14] HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters.

Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jie Zhang, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2014

Keywords: dblp

Conference Proceedings
[BPOE '14] On Big Data Benchmarking
Rui Han, Xiaoyi Lu, Jiangtao Xu

Proceedings of Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014
- Abstract
- Citation
Abstract: Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4 V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.
Cite "On Big Data Benchmarking"
- Plain text
- BibTeX
Rui Han, Xiaoyi Lu, and Jiangtao Xu. On big data benchmarking. In Proceedings of Big Data Benchmarks, Performance Optimization, and Emerging Hardware, BPOE '14, 3–18. Cham, 2014. Springer International Publishing.
@InProceedings{conf-bpoe-han14, author="Han, Rui and Lu, Xiaoyi and Xu, Jiangtao", title="On Big Data Benchmarking", booktitle="Proceedings of Big Data Benchmarks, Performance Optimization, and Emerging Hardware", year="2014", publisher="Springer International Publishing", address="Cham", pages="3--18", abstract="Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4 V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.", isbn="978-3-319-13021-7", series = "BPOE '14" }

[BPOE '14] Performance Benefits of DataMPI: A Case Study with BigDataBench.

Fan Liang, Chen Feng, Xiaoyi Lu, Zhiwei Xu

Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

Keywords: dblp

[ICPP '14] Performance Modeling for RDMA-Enhanced Hadoop MapReduce.

Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2014

Keywords: dblp

[CLUSTER '14] Scalable Graph500 design with MPI-3 RMA.

Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, Jithin Jose, Karen Tomko, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2014

Keywords: dblp

[Euro-Par '14] MapReduce over Lustre: Can RDMA-Based Approach Benefit?

Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Raghunath Rajachandrasekar, Dhabaleswar K. Panda

Proceedings of International European Conference on Parallel Processing, 2014

Keywords: dblp

[IPDPS '14] DataMPI: Extending MPI to Hadoop-Like Big Data Computing.

Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu

Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2014

Keywords: dblp

[ICS '14] HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects.

Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Dhabaleswar K. Panda

Proceedings of International Conference on Supercompuing, 2014

Keywords: dblp

[BPOE '14] A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks.

Dipti Shankar, Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat S. Islam, Dhabaleswar K. Panda

Proceedings of International Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

Keywords: dblp

[PGAS '14] Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.

Jithin Jose, Sreeram Potluri, Hari Subramoni, Xiaoyi Lu, Khaled Hamidouche, Karl W. Schulz, Hari Sundar, Dhabaleswar K. Panda

Proceedings of International Conference on Partitioned Global Address Space Programming Models, 2014

Keywords: dblp

[WBDB '13] A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks.

Xiaoyi Lu, Md. Wasi ur Rahman, Nusrat Sharmin Islam, Dhabaleswar K. Panda

Proceedings of the 3rd Workshop on Big Data Benchmarking, 2013

Keywords: dblp

[IPDPS Workshops '13] High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.

Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. Panda

Workshop Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2013

Keywords: dblp

[SoCC '13] Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

Md. Wasi ur Rahman, Xiaoyi Lu, Nusrat S. Islam, Dhabaleswar K. Panda

Proceedings of ACM Symposium on Cloud Computing, 2013 (Poster Paper)

Keywords: dblp

[HotI '13] Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Nusrat S. Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Dhabaleswar K. Panda

Proceedings of Annual Symposium on High-Performance Interconnects, 2013 (Short Paper)

Keywords: dblp

[ICPP '13] High-Performance Design of Hadoop RPC with RDMA over InfiniBand.

Xiaoyi Lu, Nusrat S. Islam, Md. Wasi ur Rahman, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. Panda

Proceedings of International Conference on Parallel Processing, 2013

Keywords: dblp

[CCGRID '13] SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience. (Best Presentation Award Nomination)

Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Chaitanya Kandalla, Mark Daniel Arnold, Dhabaleswar K. Panda

Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013

Keywords: dblp

[CLUSTER '13] A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters. (Best Student Paper Award)

Rong Shi, Sreeram Potluri, Khaled Hamidouche, Xiaoyi Lu, Karen Tomko, Dhabaleswar K. Panda

Proceedings of IEEE International Conference on Cluster Computing, 2013

Keywords: dblp

[WBDB '12] A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters.

Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi ur Rahman, Jithin Jose, Dhabaleswar K. Panda

Proceedings of the 3rd Workshop on Big Data Benchmarking, 2012

Keywords: dblp

[ISPA '11] Vega LingCloud: A Resource Single Leasing Point System to Support Heterogeneous Application Modes on Shared Infrastructure. (Best Paper Award)

Xiaoyi Lu, Jian Lin, Li Zha, Zhiwei Xu

Proceedings of IEEE International Symposium on Parallel and Distributed Processing with Applications, 2011

Keywords: dblp

[ICPP Workshops '11] Can MPI Benefit Hadoop and MapReduce Applications?

Xiaoyi Lu, Bing Wang, Li Zha, Zhiwei Xu

Workshop Proceedings of International Conference on Parallel Processing, 2011

Keywords: dblp

[NAS '10] VegaWarden: A Uniform User Management System for Cloud Applications.

Jian Lin, Xiaoyi Lu, Lin Yu, Yongqiang Zou, Li Zha

Proceedings of IEEE International Conference on Networking, Architecture, and Storage, 2010

Keywords: dblp

[SERVICES '10] Investigating, Modeling, and Ranking Interface Complexity of Web Services on the World Wide Web.

Xiaoyi Lu, Jian Lin, Yongqiang Zou, Juan Peng, Xingwu Liu, Li Zha

Proceedings of World Congress on Services, 2010

Keywords: dblp

[NPC '10] JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure.

Juan Peng, Xiaoyi Lu, Boqun Cheng, Li Zha

Proceedings of IFIP International Conference on Network and Parallel Computing, 2010

Keywords: dblp

[SERVICES '09] A Model of Message-Based Debugging Facilities for Web or Grid Services.

Qiang Yue, Xiaoyi Lu, Zhiguang Shan, Zhiwei Xu, Haiyan Yu, Li Zha

Proceedings of World Congress on Services, 2009

Keywords: dblp

[PDCAT '09] ICOMC: Invocation Complexity Of Multi-Language Clients for Classified Web Services and its Impact on Large Scale SOA Applications.

Xiaoyi Lu, Yongqiang Zou, Fei Xiong, Jian Lin, Li Zha

Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2009

Keywords: dblp

[PDCAT '08] An Experimental Analysis for Memory Usage of GOS Core.

Xiaoyi Lu, Qiang Yue, Yongqiang Zou, Xiaoning Wang

Proceedings of International Conference on Parallel and Distributed Computing, Applications, and Technologies, 2008 (Short Paper)

Keywords: dblp