Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

The Journal of Supercomputing, 2016

Dipti Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda

Abstract

With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15

Full text links

External link

Journal Article

Da
2016/12/01
Date-added
2021-02-03 03:52:03 +0000
Date-modified
2021-02-03 03:52:03 +0000
Doi
10.1007/s11227-016-1760-5
Id
Shankar2016
Isbn
1573-0484
Journal
The Journal of Supercomputing
Number
12
Pages
4573–4600
Ty
JOUR
Volume
72
Series
JSC '16
Bdsk-url-1
https://doi.org/10.1007/s11227-016-1760-5

Cite

Plain text

BibTeX