Configuring Mizan on Ubuntu

Introduction

We show in this tutorial the basic steps to install and configure Mizan’s dependencies through command line on an Ubuntu 12.04 machine. We assume the Linux’s operating system user name is (ubuntu), but you can replace it with your own user name. You have to make sure that you installed all of the following software before compiling Mizan:

  1. make and c++ compiler (build-essentials)
  2. Boost C++ library
  3. C++ threadpool library V 0.2.5
  4. MPICH2 3.0.2
  5. Java Development Kit (JDK) SE 1.6
  6. Hadoop MapReduce 1.0.4
  7. Metis/ParMetis (optional)

After you install and configure all of the above software, you have to configure Mizan’s environment valuables to be able to successfully compile Mizan’s code. Note that you need to install the same software on every worker in your cluster.

Configuring Ubuntu

  • Configuring your hostname

We assume using a cluster of machines, each worker is called with a unique hostname (cloud1, cloud2…, cloudX). Both Hadoop and Mizan requires that each worker knows the IP’s of other workers. To achieve this, you have add the workers IP’s in “/etc/hosts”. If you don’t know the IP of your worker, run the following command to get the worker’s IP:

ifconfig

Make sure that you use your machine’s hostname in the rest of the tutorial; we use “cloud1” as our worker’s hostname. To get your machine hostname, run the following command:

hostname

To change your hostname, open file “/etc/hostname”:

sudo nano /etc/hostname

Then write your new hostname, save the file and reboot the machine. Now lets put all of your cluster’s hostnames in the “hosts” file. Open “/etc/hosts” with the following command:

sudo nano /etc/hosts

Then add the IP’s for each worker in your cluster at the end of the file with the following format:

10.70.42.1     cloud1
10.70.42.2     cloud2
10.70.42.3     cloud3
.
.
.
{IP}           {HOSTNAME}

Comment all entries that has IP “127.0.x.x” with a “#” to avoid any possible confusion in IPs and save the file. You need add the same entries to other workers.

#127.0.0.1       localhost
#127.0.1.1       cloud1
  • Configuring your SSH-keys for password-less access:

Make sure that your machine has “openssh” installed. Run the following command to install “openssh” server and client:

sudo apt-get install openssh-server openssh-client

If you don’t have an public key for your workers, generate a new public key for your main machines of your cluster:

ssh-keygen -t rsa -P ""

Append your new public key to “authorized_keys”:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

You might use the same public key for all of your other workers by copying the “.ssh” folder:

scp -r ~/.ssh cloud2:
scp -r ~/.ssh cloud3:
.
.
.
scp -r ~/.ssh cloudX:

You can now test if you can access other workers without requiring a password:

ssh cloud2
  • Installing make and c++ compiler:

sudo apt-get install build-essential
  • Installing boost c++ library

sudo apt-get install libboost-all-dev
  • Installing Threadpool library

Download and unzip threadpool library from http://threadpool.sourceforge.net/. You need to move threadpool files into your boost library include folder. We assume that the main boost include folder is “/usr/include/boost”:

wget http://prdownloads.sourceforge.net/threadpool/threadpool-0_2_5-src.zip
unzip threadpool-0_2_5-src.zip
cd threadpool-0_2_5-src/threadpool/boost/
sudo mv threadpool /usr/include/boost/
sudo mv threadpool.hpp /usr/include/boost/
  • Installing MPICH2

Download and install MPICH2 from “http://www.mpich.org/?s=downloads”. You can get MPICH2 from Ubuntu’s software repository but we prefer to use the binaries from the MPICS2’s website. We install MPICH2 on the path “/home/ubuntu/mpich2”; you should modify the “–prefix” option to match your home directory:

wget http://www.mpich.org/static/tarballs/3.0.2/mpich-3.0.2.tar.gz
tar -xvf mpich-3.0.2.tar.gz
cd mpich-3.0.2/
./configure --disable-fc --disable-f77 --prefix=/home/ubuntu/mpich2
make
make install

You need to install MPICH2 for all other workers. If the other workers has identical hardware and operating system, you can simply copy the MPICH2 to other workers by the following command:

scp -r /home/ubuntu/mpich2 cloud2:
scp -r /home/ubuntu/mpich2 cloud3:
.
.
.
scp -r /home/ubuntu/mpich2 cloudX:
  • Installing Java Development Kit (JDK):

We prefer to use JDK SE 1.6 instead of Ubuntu’s openjdk. Download either the latest JDK SE 1.6 from http://www.oracle.com/technetwork/java/javase/downloads/jdk6downloads-1902814.html or “jdk-6u30” from http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jdk-6u30-oth-JPR to your home directory and install it with the following shell commands; we assume using “jdk-6u30”:

chmod 744 jdk-6u30-linux-x64.bin
./jdk-6u30-linux-x64.bin

You need to install JDK for all other workers. Similar to MPICH2, you can copy Java’s JDK folder to other workers by using the following command:

scp -r /home/ubuntu/jdk1.6.0_30 cloud2:
  • Installing Hadoop

Mizan requires access to MapReduce and HDFS. We prefer to use the Hadoop binaries available in Apache’s website instead of using software repository. Download and unzip Hadoop 1.0.4 in your home directory using the following lines:

wget http://mirrors.isu.net.sa/pub/apache/hadoop/common/hadoop-1.0.4/hadoop-1.0.4.tar.gz
tar -xvf hadoop-1.0.4.tar.gz

The next set of lines is a simple example on how to configure Hadoop on your ubuntu machine. Please follow those tutorials (running hadoop on ubuntu linux single node cluster) and (running hadoop on ubuntu linux multi node cluster) for more information about Hadoop’s configuration.

You have to define Java’s environment variable “JAVA_HOME” in Hadoop’s environment file. First, open file “hadoop-env.sh” with the following command:

nano hadoop-1.0.4/conf/hadoop-env.sh

Then, uncomment “export JAVA_HOME=” and the path to your java library. We assume using Java “jdk1.6.0_30” installed in your home directory; you might need to modify the path according to your java instillation:

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/home/ubuntu/jdk1.6.0_30

Open “masters” file:

nano hadoop-1.0.4/conf/masters

Replace the content of “masters” with the hostname of your main worker and save the file. We assume that our master worker is called “cloud1”:

cloud1

Open “slaves” file:

nano hadoop-1.0.4/conf/slaves

Replace the content of “slaves” with the hostnames of all workers and save the file. You can put only the hostname of your main worker if you are installing Hadoop on a single machine:

cloud1
cloud2
cloud3
.
.
.
cloudx

Now, we are to modify Hadoop’s configuration files. First, lets create a data folder for hadoop in the home directory; you can create this folder in any path if don’t like to have it in your home directory:

mkdir ~/hadoop_data

Then lets open file “core-site.xml” by running the command:

nano hadoop-1.0.4/conf/core-site.xml

Add the following lines to the file and save it:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/home/ubuntu/hadoop_data</value>
        </property>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://cloud1:54310</value>
        </property>
</configuration>

Now, open file “mapred-site.xml” with the command:

nano hadoop-1.0.4/conf/mapred-site.xml

Add the following lines to the file and save it:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>mapred.job.tracker</name>
                <value>cloud1:54311</value>
        </property>
</configuration>

Then, open file “hdfs-site.xml” with the command:

nano hadoop-1.0.4/conf/hdfs-site.xml

Add the following lines to the file and save it:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>

If you have more than one worker, you need to copy Hadoop’s folder to other workers by using the following commands:

scp -r hadoop_data cloud2:
scp -r hadoop-1.0.4 cloud2:

scp -r hadoop_data cloud3:
scp -r hadoop-1.0.4 cloud3:
.
.
.
scp -r hadoop_data cloudX:
scp -r hadoop-1.0.4 cloudX:

Now, lets format Hadoop’s namenode:

cd hadoop-1.0.4/bin
./hadoop namenode -format

We are done configuring Hadoop. If you want to start or stop Hadoop’s services, you can use the following commands:

cd hadoop-1.0.4/bin
./start-all.sh
./stop-all.sh

You need to configure a password-less access with other workers as shown in the beginning of this tutorial.

  • Configuring environment variables

Mizan’s Makefile depends on defining environment variables for MPI, Java, Hadoop and boost. Assuming the user name is “ubuntu”, open file “/home/ubuntu/.bashrc” using the following command:

nano /home/ubuntu/.bashrc

Then add the following lines to the end of the file. Don’t forget to use the correct paths for java and Hadoop if you used different versions that the one in this tutorial:


export MPI_HOME=/home/ubuntu/mpich2
export JAVA_HOME=/home/ubuntu/jdk1.6.0_30
export HADOOP_HOME=/home/ubuntu/hadoop-1.0.4
export BOOST_ROOT=/usr/include/boost

export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server:$HADOOP_HOME/c++/Linux-amd64-64/lib/:$BOOST_ROOT/lib
export CLASSPATH=$HADOOP_HOME/lib/commons-configuration-1.6.jar:$HADOOP_HOME/lib/commons-lang-2.4.jar:$HADOOP_HOME/lib/commons-logging-api-1.0.4.jar:$HADOOP_HOME/hadoop-core-1.0.4.jar:$HADOOP_HOME/conf
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$MPI_HOME/bin:$PATH

You also need to modify file “.bashrc” in other workers. You can copy the already modified “.bashrc” file to other workers using the following command:

scp .bashrc cloud2:
scp .bashrc cloud3:
.
.
.
scp .bashrc cloudX:

Save the file, and then run the command:

source /home/ubuntu/.bashrc

Downloading and Compiling Mizan

wget http://mizan-graph-bsp.googlecode.com/files/Mizan-0.1bu1.tar.gz
tar -xvf Mizan-0.1bu1.tar.gz
cd Mizan-0.1b/Release/
make clean; make all

If you configured all of the required environment variables and applications, you won’t see any error messages when you compile Mizan. Now you are ready to run Mizan. First we need to put a graph in the HDFS and partition it. We have a set of shell scripts that partition the input graphs and put it in HDFS for Mizan. First, makse sure Hadoop is running by running the command “start-all.sh”:

start-all.sh

We provide sample graphs in “Mizan-0.1b/preMizan/exampleGraphs”. Now let partition graph “web-Google.txt” into two partitions:

cd ../preMizan
./preMizan.sh ./exampleGraphs/web-Google.txt 2

You will get the following message on the terminal. Type “1” into your terminal to select Hash-based graph partitioning:

Select your partitioning type:
   1) Hash Graph Partitioning
   2) Range Graph Partitioning

Now, lets run PageRank on the “web-Google.txt” graph with two workers giving that web-Google has been per-partitioned using hash-based partitioning and the linux username is “ubuntu”:

cd ../Release
mpirun -np 2 ./Mizan-0.1b -u ubuntu -g web-Google.txt -w 2

To get more details on the implementation of PageRank on Mizan, please visit https://thegraphsblog.wordpress.com/the-graph-blog/mizan/#Example .You will see on your screen something very similier to the following output snippet:

Rank Rank 10 started Successfully...
 started Successfully...
!!!Hello World -- I am Mizan, nice meeting you XD !!!
	-Using HDFS as input source with file: /user/ubuntu/m_output/mizan_web-Google.txt_mhash_2/part-r-00000, HDFS status = 0
	-Using HDFS as input source with file: /user/ubuntu/m_output/mizan_web-Google.txt_mhash_2/part-r-00001, HDFS status = 0
 tmpStorage.size() = 0 for file /user/ubuntu/m_output/mizan_web-Google.txt_mhash_2/part-r-00001
-----TIME: Data Loading for 0 = 15.14
 tmpStorage.size() = 0 for file /user/ubuntu/m_output/mizan_web-Google.txt_mhash_2/part-r-00000
-----TIME: Data Loading for 0 = 20.01
Starting user init..
PE ---------- Starting Superstep 1 ----------
0 -----TIME: userInit() Running Time = 0
PE0 Processing SS time = 8.15
vertices = 437904 inedges = 0 outedges = 2551991
PE0 mc size = 530590 out from = 2551991 time = 4.2
PE0 comm time = 4.29
PE0 -----TIME: superStep() Running Time = 8 Vertex Running Time = 0
PE1 -----Messages: Actual Finish = 8 Global in comm = 0/530825 Global out Comm = 0 Memory Rem = 33
PE0 -----Messages: Actual Finish = 8 Global in comm = 0/531269 Global out Comm = 0 Memory Rem = 33
PE0 -----Messages: Actual Finish = 8 -----
PE0 -----Messages: Min Mem = 33 -----
---------- Starting Superstep 2 ----------
PE0 Processing SS time = 5.65
. . .
. . .
. . .
---------- Starting Superstep 21 ----------
PE0 Processing SS time = 0.5
vertices = 437904 inedges = 531269 outedges = 0
PE0 mc size = 0 out from = 0 time = 0
PE0 comm time = 0
PE0 -----TIME: superStep() Running Time = 0 Vertex Running Time = 0
-----TIME: Total Running Time without IO = 126
-----TIME: Total Running Time = 138
!!!bye bye -- terminating Mizan, see you later!!!

If you want to run Mizan in a distributed environment you have to copy Mizan’s binary to all other workers and specify a “machines” file that contains your hostnames. Now lets open a new file called “machines”:

nano machines

Write on each line a worker’s hostname in your cluster, and save the file:

cloud1
cloud2
cloud3
.
.
.
cloudX

Copy Mizan’s binary to the same path on other workers, we assume that you decompressed Mizan in your home directory. First, make the same Mizan directories on other workers:

ssh cloud2 mkdir -p /home/ubuntu/Mizan/Mizan-0.1b/Release
ssh cloud3 mkdir -p /home/ubuntu/Mizan/Mizan-0.1b/Release
.
.
.
ssh cloudX mkdir -p /home/ubuntu/Mizan/Mizan-0.1b/Release

Then copy Mizan’s binary to other workers, you need to do this after each recompilation to Mizan:

scp /home/ubuntu/Mizan/Mizan-0.1b/Release/Mizan-0.1b cloud2:/home/ubuntu/Mizan/Mizan-0.1b/Release
scp /home/ubuntu/Mizan/Mizan-0.1b/Release/Mizan-0.1b cloud3:/home/ubuntu/Mizan/Mizan-0.1b/Release
.
.
.
scp /home/ubuntu/Mizan/Mizan-0.1b/Release/Mizan-0.1b cloudX:/home/ubuntu/Mizan/Mizan-0.1b/Release

Now execute the following command to run Mizan in a distributed environment, note the use of “machines” in the MPI parameters:

cd ../Release
mpirun -f machines -np 2 ./Mizan-0.1b -u ubuntu -g web-Google.txt -w 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s