Handling data skew in MapReduce

Benjamin Gufler, Nikolaus Augsten, Angelika Reiser, Alfons Kemper

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

99 Scopus citations

Abstract

MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is often skewed. This leads to highly varying execution times for the reducers. Varying execution times result in low resource utilisation and high overall execution time since the next MapReduce cycle can only start after all reducers are done. In this paper we address the problem of efficiently processing MapReduce jobs with complex reducer tasks over skewed data. We define a new cost model that takes into account non-linear reducer tasks and we provide an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, fine partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data and complex reduce tasks. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution on both synthetic data and real data from an e-science application.

Original languageEnglish
Title of host publicationCLOSER 2011 - Proceedings of the 1st International Conference on Cloud Computing and Services Science
Pages574-583
Number of pages10
StatePublished - 2011
Event1st International Conference on Cloud Computing and Services Science, CLOSER 2011 - Noordwijkerhout, Netherlands
Duration: 7 May 20119 May 2011

Publication series

NameCLOSER 2011 - Proceedings of the 1st International Conference on Cloud Computing and Services Science

Conference

Conference1st International Conference on Cloud Computing and Services Science, CLOSER 2011
Country/TerritoryNetherlands
CityNoordwijkerhout
Period7/05/119/05/11

Keywords

  • Data skew
  • Load balancing
  • MapReduce

Fingerprint

Dive into the research topics of 'Handling data skew in MapReduce'. Together they form a unique fingerprint.

Cite this