Hi from all of us here in Prague — this is day 2 of Eurosys and we’ll be running the live blog as usual!
Your friendly bloggers are Natacha Crooks (nscc), Ionel Gog (icg), Valentin Dalibard (vd) and Malte Schwarzkopf (ms).
Session 1: Large scale distributed computation II
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Zuhair Khayyat, Karim Awara, and Amani Alonazi (King Abdullah University of Science and Technology), Hani Jamjoom and Dan Williams (IBM T. J. Watson Research Center, Yorktown Heights), and Panos Kalnis (King Abdullah University of Science and Technology)
Natacha Crooks – Mizan a system for dynamic load balancinc in large scale graph processing
Liveblog from EuroSys 2013 — Day 2 « syslog.
Amazon EC2 t1.micro Instance Performance BenchmarksServeTheHome – Server and Workstation Reviews.
A often asked question is whether one should use an Amazon EC2 t1.micro instance or a VMWare ESXi 5.0 server at home. For this test I decided to use the Linux 64-bit version of Geekbench by Primate Labs which is a fairly popular benchmark that does a decent job of quickly profiling the performance of environments. I had planned to do this piece using an older version of GeekBench a few months ago, but I saw some performance anomalies that had performance well below what I expected. I took a look back four months later and I think the results are more in-line with what I was expecting. Here is the big question for many users, should I build my own server using ESXi 5.0 or use the Amazon cloud for development and testing work. I think that there a few really good reasons to go with the Amazon EC2 cloud over a build-your-own approach as moving things into production are much easier and Amazon is building a fairly robust infrastructure behind its cloud offerings that do allow massive scale-out. A lot of people think that anything in the cloud must be faster, even if they are paying $0.02/ hour or $0.48/ day. For one or two develoment instances, Amazon’s EC2 offering is very compelling. For those wondering, this site has been running in the Amazon EC2 cloud for about a year now and I do maintain one development instance for testing.
Read the rest of the article at: http://www.servethehome.com/amazon-ec2-t1micro-instance-performance-benchmarks/
This paper is published in Computing Research Repository (CoRR) journal in 2013, and it includes reviews to a wide set of research work that target MapReduce framework. Part of the reviewed papers touches on improving the performance of MapReduce by either implementing workarounds, through user API, to implement complex operators or by modifying Hadoop’s source code to improve performance. The paper starts with showing the work done on processing join operations and iterative jobs over MapReduce and how to overcome its API limitations. After that, the paper talked about systems that targets resource sharing, data horizontal and vertical sharing to improve on the performance of MapReduce and reduce its processing overhead. The paper also discusses data access optimizations over Hadoop file system (HDFS) which included using indices, column store and improving data locality using file co-location strategies. On the other hand, other researchers try to improve the performance of MapReduce by decoupling HDFS form Hadoop using pipelining streaming and incremental writes to HDFS. Since MapReduce depends on large number of user-defined parameters, researchers suggested profiling the user’s jobs and suggest parameter tunings according the the application behaviour. The paper also included systems that tries to optimize failure recovery and redundancy of MapReduce through optimizing job placement to improve Hadoop’s performance. After reviewing the work done on MapReduce optimizations, this paper includes systems that are built over Hadoop to provide reliable and scalable high level operations by expressing them using a series of MapReduce jobs. Finally, the paper finish up with reviewing systems that are built based on the architecture of MapReduce but provides more flexible operations and support wider range of applications.
In my opinion, the programming paradigm of MapReduce is not new since its techniques are well researched for more than two decades. However, researchers recently been able to produce lots of papers about MapReduce thanks to Apache’s Hadoop MapReduce, that is available for free since 2007. I wander if either Google or Apache considers the optimizations, suggested by the research community, seriously or thinks that they are just an overkill solution to their applications.
Where’s the Data in the Big Data Wave? | Gerhard Weikum
There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?
Check the rest of the post at: http://wp.sigmod.org/?p=786
‘Big data’ is dead. What’s next? | VentureBeat.
This is a guest post by technology executive John De Goes
“Big data” is dead. Vendors killed it. Well, industry leaders helped, and the media got the ball rolling, but vendors hold the most responsibility for the painful, lingering death of one of the most overhyped and poorly understood terms since the phrase “cloud computing.”