Dipti Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dhabaleswar K. Panda
With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15