Short Bio
Dr. Xiaoyi Lu is an Assistant Professor in the Department of Computer Science
and
Engineering
at The University of California, Merced, USA. He
is the founder and director of Parallel and Distributed Systems Laboratory
(PADSYS). Previously (2018-2020), he was a Research
Assistant Professor at the Ohio State University (OSU). His current research
interests include parallel and distributed computing, high-performance
interconnects, advanced I/O technologies, Big Data Analytics, Virtualization,
Cloud Computing, and Deep Learning system software. He has published more than
100 papers in major international conferences, workshops, and journals with
multiple Best (Student) Paper Awards or Nominations. He has delivered more than
100 times of invited talks, tutorials, and presentations worldwide. He has been
actively involved in various professional activities in academic journals and
conferences. Many of Dr. Lu’s research outcomes (e.g, PMIdioBench; RDMA for
Hadoop, Spark, TensorFlow, Memcached, and Kafka; MVAPICH2-Virt; DataMPI; LingCloud;
NeuroHPC) are made publicly available to the community and currently being used
by hundreds of organizations all over the world. Please see my main personal
page for more information.
Research Interests
My research interests include:
- Parallel and Distributed Computing
- Systems for HPC, Big Data, AI, Cloud Computing, Edge Computing, and others
- High-Performance Communication and I/O Technologies (e.g., RDMA/PMEM/NVMe)
- Container- and Hypervisor-based Virtualization
- Performance, Scalability, Fault tolerance, QoS, and others
Publications
-
[IPDPS '24] An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression
-
[IEEE Micro'24] High-Speed Data Communication with Advanced Networks in Large Language Model Training
-
[IEEE Micro'24] Compression Analysis for BlueField-2/3 Data Processing Units: Lossy and Lossless Perspectives
-
[IPDPS '24] Accelerating Lossy and Lossless Compression on Emerging BlueField DPU Architectures
-
[ICS '24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
-
[IPDPS '24] DRUTO: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications
-
[IPDPS '24] NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy Support
-
[HotI '23] Performance Characterization of Large Language Models on High-Speed Interconnects
-
[ModSim '23] LogGOPSGauger: A Work-In-Progress Tool for Gauging LogGOPS Model with GPU-Aware Communication
-
[MlSys Workshop'23] Learning Distributed Protocols with Zero Knowledge
-
[IPDPS '23] SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance
-
[JCST '23] xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning
-
[SC '23] Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters
-
[SC '23] Characterizing One-/Two-sided Designs in OpenSHMEM Collectives
-
[ModSim '23] Early Experiences in Modeling Performance Implications of DPU-Offloaded Computation
-
[SC '23] An Early Case Study with Multi-Tenancy Support in SPDK’s NVMe-over-Fabric Designs
-
[HotI '23] Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures
-
[HotInfra '23] On the Discontinuation of Persistent Memory: Looking Back to Look Forward
-
[MIT Press '22] High-Performance Big Data Computing
-
[VLDB '22] A Study of Database Performance Sensitivity to Experiment Settings
-
[arXiv '22] Arcadia: A Fast and Reliable Persistent Memory Replicated Log
-
[HPDC '22] NVMe-oAF: Towards Adaptive NVMe-oF for IO-Intensive Workloads on HPC Cloud
-
[HiPC '22] HiBGT: High-Performance Bayesian Group Testing for COVID-19
-
[Biostatistics '22] Bayesian Group Testing with Dilution Effects
-
[TBench '22] Understanding hot interconnects with an extensive benchmark survey
-
[Bench '22] Benchmarking Object Detection Models with Mummy Nuts Dataset
-
[IPDPS '21] NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics
-
[WORDS '21] Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap
-
[HPDC '21] DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM
-
[VLDB '21] Understanding the Idiosyncrasies of Real Persistent Memory
-
[SC '21] HatRPC: Hint-Accelerated Thrift RPC over RDMA
-
[Bench '20] Impact of Commodity Networks on Storage Disaggregation with NVMe-oF
-
[SC '20] RDMP-KV: Designing Remote Direct Memory Persistence Based Key-Value Stores with PMEM
-
[JCST '20] CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
-
[SC '20] INEC: Fast and Coherent in-Network Erasure Coding
-
[Bench '19] Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads.
-
[NVMW '19] Accelerating NVRAM-aware In-Memory Datastore with Remote Direct Memory Persistence
-
[SC '19] Three-Dimensional Characterization on Edge AI Processors with Object Detection Workloads
-
[TPDS '19] Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast
-
[THPC '19] Performance analysis of deep learning workloads using roofline trajectories
-
[IISWC '19] SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures.
-
[SC '19] TriEC: Tripartite Graph Based Erasure Coding NIC Offload
-
[HPDC '19] UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems.
-
[HiPC '19] SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures.
-
[SC '19] Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems
-
[IPDPS '19] C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.
-
[HiPC '18] OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
-
[FITEE '18] Networking and communication challenges for post-exascale systems
-
[TMSCS '18] DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters
-
[Bench '18] HPC AI500: A Benchmark Suite for HPC AI Systems.
-
[BigData '18] Spark-uDAPL: Cost-Saving Big Data Analytics on Microsoft Azure Cloud with RDMA Networks*.
-
[BPOE '18] Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences
-
[EuroMPI '18] Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.
-
[ISC '18] Can Unified-Memory Support on Pascal and Volta GPUs enable Out-of-Core DNN Training?
-
[Bench '18] A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?
-
[JPDC '18] MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters
-
[UCC '18] Analyzing, Modeling, and Provisioning QoS for NVMe SSDs.
-
[NVMW '18] Accelerating MapReduce and DAG Execution Frameworks with Non-Volatile Memory and RDMA
-
[Bench '18] EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures.
-
[HiPC '18] Accelerating TensorFlow with Adaptive RDMA-Based gRPC.
-
[SoCC '18] High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences.
-
[CLUSTER '18] Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing.
-
[HotI '17] Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-Capable Networks.
-
[ICDCS '17] High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads.
-
[TPDS '17] A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters
-
[BigData '17] Performance characterization and acceleration of big data workloads on OpenPOWER system.
-
[HiPC '17] MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI.
-
[BDCAT '17] Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka.
-
[SC '17] Scalable reduction collectives with data partitioning-based multi-leader design.
-
[VEE '17] Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand.
-
[CLUSTER '17] A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.
-
[BigData '17] NVMD: Non-volatile memory assisted design for accelerating MapReduce and DAG execution frameworks on HPC systems.
-
[NVMW '17] NRCIO: NVM-aware RDMA-based Communication and I/O Schemes for Big Data Analytics
-
[Chapter '17] Building Efficient HPC Cloud with SR-IOV-Enabled InfiniBand: The MVAPICH2 Approach.
-
[UCC '17] Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?
-
[BigData '17] Characterizing and accelerating indexing techniques on distributed ordered tables.
-
[ICPP '17] Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
-
[TCDE '17] Scalable and Distributed Key-Value Store-based Data Management Using RDMA-Memcached.
-
[CCGrid '17] Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud.
-
[UCC '17] HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications.
-
[HiPC '17] Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.
-
[IPDPS '17] High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters.
-
[BDCAT '16] Performance characterization of hadoop workloads on SR-IOV-enabled virtualized InfiniBand clusters.
-
[SBAC-PAD '16] MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers.
-
[BigData '16] High-performance design of apache spark with RDMA and its benefits on various workloads.
-
[IPDPS '16] High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits.
-
[CloudCom '16] Designing Virtualization-Aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-Enabled Clouds.
-
[Euro-Par '16] Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem.
-
[ICPP '16] High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters.
-
[SC '16] Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.
-
[ICS '16] High Performance Design for HDFS with Byte-Addressability of NVM and RDMA.
-
[JSC '16] Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters
-
[BigData '16] Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O.
-
[CloudCom '16] Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.
-
[XSEDE '16] Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet.
-
[BigData '16] Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage.
-
[Chapter '16] Accelerating Big Data Processing on Modern HPC Clusters
-
[IPDPS Workshops '16] Performance Characterization of Hypervisor-and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters.
-
[ISC '16] INAM2: InfiniBand Network Analysis and Monitoring with MPI.
-
[HiPC '16] Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA.
-
[PDSW-DISCS '16] Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?
-
[CCGRID '15] MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds.
-
[CLUSTER '15] High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.
-
[CCGRID '15] Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture.
-
[BigData '15] Benchmarking key-value stores on high-performance storage and interconnects for web-scale workloads.
-
[ICDCS '15] Accelerating Apache Hive with MPI for Data Warehouse Systems.
-
[BigDataService '15] Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework.
-
[JCST '15] Accelerating Iterative Big Data Computing Through MPI
-
[IPDPS Workshops '15] High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation.
-
[OpenSHMEM '15] Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM.
-
[Euro-Par '15] High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters.
-
[HiPC '15] High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR.
-
[BPOE '15] A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS.
-
[IPDPS '15] High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA.
-
[ICPP '15] Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store.
-
[ISPASS '15] Can RDMA benefit online data processing workloads on memcached and MySQL?
-
[BigData '15] Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters.
-
[BigData '14] In-memory I/O and replication for HDFS with Memcached: Early experiences.
-
[PGAS '14] Scalable MiniMD Design with Hybrid MPI and OpenSHMEM.
-
[PPoPP '14] Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems.
-
[Euro-Par '14] Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?
-
[HiPC '14] High performance MPI library over SR-IOV enabled infiniband clusters.
-
[NAS '14] Performance Characterization of Hadoop and Data MPI Based on Amdahl's Second Law.
-
[HotI '14] Accelerating Spark with RDMA for Big Data Processing: Early Experiences.
-
[CLUSTER '14] High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design.
-
[HPDC '14] SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS.
-
[ICPP '14] HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters.
-
[BPOE '14] On Big Data Benchmarking
-
[BPOE '14] Performance Benefits of DataMPI: A Case Study with BigDataBench.
-
[ICPP '14] Performance Modeling for RDMA-Enhanced Hadoop MapReduce.
-
[CLUSTER '14] Scalable Graph500 design with MPI-3 RMA.
-
[Euro-Par '14] MapReduce over Lustre: Can RDMA-Based Approach Benefit?
-
[IPDPS '14] DataMPI: Extending MPI to Hadoop-Like Big Data Computing.
-
[ICS '14] HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects.
-
[BPOE '14] A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks.
-
[PGAS '14] Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.
-
[WBDB '13] A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks.
-
[IPDPS Workshops '13] High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.
-
[SoCC '13] Does RDMA-based enhanced Hadoop MapReduce need a new performance model?
-
[HotI '13] Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
-
[ICPP '13] High-Performance Design of Hadoop RPC with RDMA over InfiniBand.
-
[CCGRID '13] SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience.
-
[CLUSTER '13] A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters.
-
[WBDB '12] A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters.
-
[ISPA '11] Vega LingCloud: A Resource Single Leasing Point System to Support Heterogeneous Application Modes on Shared Infrastructure.
-
[ICPP Workshops '11] Can MPI Benefit Hadoop and MapReduce Applications?
-
[NAS '10] VegaWarden: A Uniform User Management System for Cloud Applications.
-
[SERVICES '10] Investigating, Modeling, and Ranking Interface Complexity of Web Services on the World Wide Web.
-
[NPC '10] JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure.
-
[SERVICES '09] A Model of Message-Based Debugging Facilities for Web or Grid Services.
-
[PDCAT '09] ICOMC: Invocation Complexity Of Multi-Language Clients for Classified Web Services and its Impact on Large Scale SOA Applications.
-
[PDCAT '08] An Experimental Analysis for Memory Usage of GOS Core.
News Posts