Soft IT or SDDC (Software-Defined Data Center)

Jnan Dash's Weblog

I like this major new trend known as software-defined data center and this will be a highly disruptive force in enterprise computing. Gone are the days of expensive physical data center owned by large corporations. Now it is the rise of “soft” infrastructure. Virtual machines and virtual networks and storage can be provisioned and reconfigured rapidly and in a highly automated way, rather than being limited by the constraints of hardware infrastructure that was built for a much less dynamic environment. Most of all it makes great economic sense as the resource utilization will be highly efficient.

The “software-defined data center,” as it is commonly known, has business repercussions that go well beyond transforming data center technology. It has shaken long-term alliances between technology giants. Vendors are scrambling to reposition themselves to best exploit this new era of soft IT. VMWare which specialized in the server virtualization business, is expanding to…

View original post 238 more words

Liveblog from EuroSys 2013

Hi from all of us here in Prague — this is day 2 of Eurosys and we’ll be running the live blog as usual!

Your friendly bloggers are Natacha Crooks (nscc), Ionel Gog (icg), Valentin Dalibard (vd) and Malte Schwarzkopf (ms).

Session 1: Large scale distributed computation II

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Zuhair Khayyat, Karim Awara, and Amani Alonazi (King Abdullah University of Science and Technology), Hani Jamjoom and Dan Williams (IBM T. J. Watson Research Center, Yorktown Heights), and Panos Kalnis (King Abdullah University of Science and Technology)

Natacha Crooks – Mizan a system for dynamic load balancinc in large scale graph processing

Liveblog from EuroSys 2013 — Day 2 « syslog.


Huan Liu's Blog

How do you compare the cost of two cloud or IaaS offerings? Is Amazon EC2’s small instance (1 ECU, 1.7GB RAM, 160GB storage) cheaper or is Rackspace cloud’s 256MB server (4 cores, 256MB RAM, 10GB storage) cheaper? Unfortunately, answering this question is very difficult. One reason is that cloud vendors have been offering virtual machines with different configurations, i.e., different combinations of CPU power, memory and storage, making is difficult to perform an apple-to-apple comparison.

Towards the goal of a better apple-to-apple comparison, I will break down the cost for CPU, memory and storage individually for Amazon EC2 in this post. For those not interested in understanding the methodology, the high level conclusions are as follows. In Amazon’s N. Virginia data center, the unit costs are:

  • 1 ECU costs $0.01369/hour
  • 1 GB of RAM costs $0.0201/hour
  • 1 GB of local storage costs $0.000159/hour
  • A 10GB network interface costs $0.41/hour
  • A…

View original post 817 more words

Very intresting article about EC2 enviroment

Huan Liu's Blog

Ever wonder what hardware is running behind Amazon’s EC2? Why would you even care? Well, there are at least a couple of reasons.

  1. Side-by-side comparisons. Amazon express their machine power in terms of EC2 compute units (ECU) and other cloud providers just express it in terms of number of cores. Either case, it is vague and you cannot perform economical comparison between different cloud offerings, and with owning-your-own-hardware approach. Knowing how much a EC2 computing unit is in terms of hardware raw power allows you to perform apple-to-apple comparison.
  2. Physical isolation. In many enterprise clients’ mind, security is the number one concern. Even though hypervisor isolation is robust, they feel more comfortable if there is a physical separation, i.e., they do not want their VM to sit on the same physical hardware right next to a hacker’s VM. Knowing the hardware computing power and the largest VM’s computing…

View original post 1,125 more words


Amazon EC2 t1.micro Instance Performance BenchmarksServeTheHome – Server and Workstation Reviews.

A often asked question is whether one should use an Amazon EC2 t1.micro instance or a VMWare ESXi 5.0 server at home. For this test I decided to use the Linux 64-bit version of Geekbench by Primate Labs which is a fairly popular benchmark that does a decent job of quickly profiling the performance of environments.  I had planned to do this piece using an older version of GeekBench a few months ago, but I saw some performance anomalies that had performance well below what I expected. I took a look back four months later and I think the results are more in-line with what I was expecting. Here is the big question for many users, should I build my own server using ESXi 5.0 or use the Amazon cloud for development and testing work. I think that there a few really good reasons to go with the Amazon EC2 cloud over a build-your-own approach as moving things into production are much easier and Amazon is building a fairly robust infrastructure behind its cloud offerings that do allow massive scale-out. A lot of people think that anything in the cloud must be faster, even if they are paying $0.02/ hour or $0.48/ day. For one or two develoment instances, Amazon’s EC2 offering is very compelling. For those wondering, this site has been running in the Amazon EC2 cloud for about a year now and I do maintain one development instance for testing.

Read the rest of the article at:


Huan Liu's Blog

Amazon today announced a new instance type called “Micro instances” (t1.micro). The official announcement states that its comes with 613MB RAM, and up to 2 ECU compute units. It also supports both 32 and 64 bits. Starting at $0.02/hour, Micro instances are the least expensive instances offered by AWS.

To understand Micro instances, let us first see what is the underneath physical hardware powering them. In a previous post, we have analyzed AWS’s physical hardware and ECU. Using the same methodology, we see that Micro instances use the same physical hardware powering the standard instances, i.e., single-socket Intel E5430 processor based systems. In fact, they probably run on the same clusters as the standard instances.

To understand what is the actual computing power they deliver, we run a CPU-intensive application trying to grab as much CPU as we are allowed. We then use the UNIX command top to…

View original post 395 more words

Review of “The Family of MapReduce and Large Scale Data Processing Systems”

This paper is published in Computing Research Repository (CoRR) journal in 2013, and it includes reviews to a wide set of research work that target MapReduce framework. Part of the reviewed papers touches on improving the performance of MapReduce by either implementing workarounds, through user API, to implement complex operators or by modifying Hadoop’s source code to improve performance. The paper starts with showing the work done on processing join operations and iterative jobs over MapReduce and how to overcome its API limitations. After that, the paper talked about systems that targets resource sharing, data horizontal and vertical sharing to improve on the performance of MapReduce and reduce its processing overhead. The paper also discusses data access optimizations over Hadoop file system (HDFS) which included using indices, column store and improving data locality using file co-location strategies. On the other hand, other researchers try to improve the performance of MapReduce by decoupling HDFS form Hadoop using pipelining streaming and incremental writes to HDFS. Since MapReduce depends on large number of user-defined parameters, researchers suggested profiling the user’s jobs and suggest parameter tunings according the the application behaviour. The paper also included systems that tries to optimize failure recovery and redundancy of MapReduce through optimizing job placement to improve Hadoop’s performance. After reviewing the work done on MapReduce optimizations, this paper includes systems that are built over Hadoop to provide reliable and scalable high level operations by expressing them using a series of MapReduce jobs. Finally, the paper finish up with reviewing systems that are built based on the architecture of MapReduce but provides more flexible operations and support wider range of applications.
In my opinion, the programming paradigm of MapReduce is not new since its techniques are well researched for more than two decades. However, researchers recently been able to produce lots of papers about MapReduce thanks to Apache’s Hadoop MapReduce, that is available for free since 2007. I wander if either Google or Apache considers the optimizations, suggested by the research community, seriously or thinks that they are just an overkill solution to their applications.

Big Data should be Interesting Data! | Gerhard Weikum

Where’s the Data in the Big Data Wave? | Gerhard Weikum

There are various definitions of Big Data; most center around a number of V’s like volume, velocity, variety, veracity – in short: interesting data (interesting in at least one aspect). However, when you look into research papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see here in experimental studies is utterly boring. Performance and scalability experiments are often based on the TPC-H benchmark: completely synthetic data with a synthetic workload that has been beaten to death for the last twenty years. Data quality, data cleaning, and data integration studies are often based on bibliographic data from DBLP, usually old versions with less than a million publications, prolific authors, and curated records. I doubt that this is a real challenge for tasks like entity linkage or data cleaning. So where’s the – interesting – data in Big Data research?

Check the rest of the post at:

‘Big data’ is dead. What’s next? | VentureBeat

‘Big data’ is dead. What’s next? | VentureBeat.

This is a guest post by technology executive John De Goes

“Big data” is dead. Vendors killed it. Well, industry leaders helped, and the media got the ball rolling, but vendors hold the most responsibility for the painful, lingering death of one of the most overhyped and poorly understood terms since the phrase “cloud computing.”

Five Questions around Big Data

Jnan Dash's Weblog

Data is the new currency of business and we are in the era of data-intensive computing. Much has been written on Big Data throughout 2012 and customers around the world are struggling to figure out its significance to their businesses. Someone said there are 3 I’s to Big Data

  • Immediate (I must do something right away)
  • Intimidating (what will happen if I don’t take advantage of Big Data)
  • Ill-defined (the term is so broad that I’m not clear what it means).

In this blog post, I would like to pose five key questions that customers must find answers to with regards to Big Data. So here goes.

1. Do I understand my data and do I have a data strategy?

There are varieties of data – customer transaction data, operational data, documents/emails and other unstructured data, clickstream data, sensor data, audio streams, video streams, etc. Do I have a clear…

View original post 553 more words