Accelerating Iterative Big Data Computing Through MPI

Journal of Computer Science and Technology, 2015

Fan Liang, Xiaoyi Lu

Abstract

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

Full text links

External link

Journal Article

Da
2015/03/01
Date-added
2021-02-03 03:39:54 +0000
Date-modified
2021-02-03 03:39:54 +0000
Doi
10.1007/s11390-015-1522-5
Id
Liang2015
Isbn
1860-4749
Journal
Journal of Computer Science and Technology
Number
2
Pages
283–294
Ty
JOUR
Volume
30
Series
JCST '15
Bdsk-url-1
https://doi.org/10.1007/s11390-015-1522-5

Cite

Plain text

BibTeX