Review of “The Family of MapReduce and Large Scale Data Processing Systems”

This paper is published in Computing Research Repository (CoRR) journal in 2013, and it includes reviews to a wide set of research work that target MapReduce framework. Part of the reviewed papers touches on improving the performance of MapReduce by either implementing workarounds, through user API, to implement complex operators or by modifying Hadoop’s source code to improve performance. The paper starts with showing the work done on processing join operations and iterative jobs over MapReduce and how to overcome its API limitations. After that, the paper talked about systems that targets resource sharing, data horizontal and vertical sharing to improve on the performance of MapReduce and reduce its processing overhead. The paper also discusses data access optimizations over Hadoop file system (HDFS) which included using indices, column store and improving data locality using file co-location strategies. On the other hand, other researchers try to improve the performance of MapReduce by decoupling HDFS form Hadoop using pipelining streaming and incremental writes to HDFS. Since MapReduce depends on large number of user-defined parameters, researchers suggested profiling the user’s jobs and suggest parameter tunings according the the application behaviour. The paper also included systems that tries to optimize failure recovery and redundancy of MapReduce through optimizing job placement to improve Hadoop’s performance. After reviewing the work done on MapReduce optimizations, this paper includes systems that are built over Hadoop to provide reliable and scalable high level operations by expressing them using a series of MapReduce jobs. Finally, the paper finish up with reviewing systems that are built based on the architecture of MapReduce but provides more flexible operations and support wider range of applications.
In my opinion, the programming paradigm of MapReduce is not new since its techniques are well researched for more than two decades. However, researchers recently been able to produce lots of papers about MapReduce thanks to Apache’s Hadoop MapReduce, that is available for free since 2007. I wander if either Google or Apache considers the optimizations, suggested by the research community, seriously or thinks that they are just an overkill solution to their applications.


One response to “Review of “The Family of MapReduce and Large Scale Data Processing Systems”

  1. I want to ask a question…that If we have a finite presentation of a group, but order of the generators are unknown, then can we draw a cayley graph of that group?? if yes, then by which mean

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s