This paper is published in Computing Research Repository (CoRR) journal in 2013, and it includes reviews to a wide set of research work that target MapReduce framework. Part of the reviewed papers touches on improving the performance of MapReduce by either implementing workarounds, through user API, to implement complex operators or by modifying Hadoop’s source code to improve performance. The paper starts with showing the work done on processing join operations and iterative jobs over MapReduce and how to overcome its API limitations. After that, the paper talked about systems that targets resource sharing, data horizontal and vertical sharing to improve on the performance of MapReduce and reduce its processing overhead. The paper also discusses data access optimizations over Hadoop file system (HDFS) which included using indices, column store and improving data locality using file co-location strategies. On the other hand, other researchers try to improve the performance of MapReduce by decoupling HDFS form Hadoop using pipelining streaming and incremental writes to HDFS. Since MapReduce depends on large number of user-defined parameters, researchers suggested profiling the user’s jobs and suggest parameter tunings according the the application behaviour. The paper also included systems that tries to optimize failure recovery and redundancy of MapReduce through optimizing job placement to improve Hadoop’s performance. After reviewing the work done on MapReduce optimizations, this paper includes systems that are built over Hadoop to provide reliable and scalable high level operations by expressing them using a series of MapReduce jobs. Finally, the paper finish up with reviewing systems that are built based on the architecture of MapReduce but provides more flexible operations and support wider range of applications.
In my opinion, the programming paradigm of MapReduce is not new since its techniques are well researched for more than two decades. However, researchers recently been able to produce lots of papers about MapReduce thanks to Apache’s Hadoop MapReduce, that is available for free since 2007. I wander if either Google or Apache considers the optimizations, suggested by the research community, seriously or thinks that they are just an overkill solution to their applications.
No upcoming events