Hadoop cluster

We have set up a Hadoop cluster on the new blade servers at ILP to support the Big_Data project. There are 48 servers nodes and one master node. Each blade has 8 cores, 8GB of RAM, and 300GB of disk.

The main webpage for the Hadoop project is available here. We're currently running Hadoop version 0.18.0.

To access the cluster, first login to hadoop-client. Hadoop is installed in /usr/local/hadoop/hadoop.

 mymachine:~> ssh hadoop-client
 hadoop-client:~> cd /usr/local/hadoop/hadoop

If you are not connected to the bigdata network, you can access it via ssh through bigdata.intel-research.net.

mymachine:~> ssh <username>@bigdata.intel-research.net
bigdata:~> ssh hadoop-client
hadoop-client:~> cd /usr/local/hadoop/hadoop

Using Hadoop from the command line

Almost everything you'll want to do with Hadoop (e.g., getting data to and from the dfs, starting and stopping jobs) is accomplished via the command line by running the "bin/hadoop" program. Running "bin/hadoop dfs" accesses the filesystem functions, "bin/hadoop job" lets you monitor and kill jobs, and "bin/hadoop jar" runs code from a jar file. Running any of these without further command line options will print out usage information. For example:

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs
 Usage: java FsShell
          [-fs ]
          [-conf ]
          [-D <[property=value>]
          [-ls ]
          [-lsr ]
          [-du ]
          [-dus ]
          [-mv  ]
          [-cp  ]
          [-rm ]
          [-rmr ]
          [-put  ]
          [-copyFromLocal  ]
          [-moveFromLocal  ]
          [-get [-crc]  ]
          [-getmerge   [addnl]]
          [-cat ]
          [-copyToLocal [-crc]  ]
          [-moveToLocal [-crc]  ]
          [-mkdir ]
          [-setrep [-R]  ]

You can see the files in your dfs home directory (/user/):

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -ls
 Found 1 items
 /user/swschlos/complete_works_of_shakespeare.txt           5582655

Yes, this command-line interface to the dfs is hokey, but it's the best they've got so far. You can also browse the dfs via the web.

Running the example programs

Hadoop comes with a jar file of example programs. Here is how to run them from the command line:

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop jar hadoop-0.12.3-examples.jar
 An example program must be given as the first argument.
 Valid program names are:
   grep: A map/reduce program that counts the matches of a regex in the input.
   pi: A map/reduce program that estimates Pi using monte-carlo method.
   randomwriter: A map/reduce program that writes 10GB of random data per node.
   sort: A map/reduce program that sorts the data written by the random writer.
   wordcount: A map/reduce program that counts the words in the input files.

wordcount is a good example program to try first. It counts the number of times a word appears in a set of input text files. Wordcount is one of the example programs described in the Google MapReduce paper.

I've put a copy of the complete works of Shakespeare in my home directory, which you can copy for testing.

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -mkdir shakespeare
 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -cp /user/swschlos/complete_works_of_shakespeare.txt shakespeare
 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop jar hadoop-0.12.3-examples.jar wordcount shakespeare shakespeare-output

This will create many output files, one from each reduce task that was run. For me it was 211 reduces:

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -ls shakespeare-output
 Found 211 items
 /user/swschlos/shakespeare-out/part-00000          2280
 /user/swschlos/shakespeare-out/part-00001          2196
 /user/swschlos/shakespeare-out/part-00210          2202

There are several ways to get data out of the dfs to your local filesystem, but probably the easiest in this case is to just cat all of the files together:

 hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -cat shakespeare-output/* > ~/shakespeare-output.txt
 hadoop-client:/usr/local/hadoop/hadoop> grep -i prithee ~/shakespeare-output.txt
 prithee,        60
 prithee.        9
 tale.-Prithee   1
 Prithee 37
 prithee 61
 prithee;        4
 prithee?        1
 Prithee,        43
 prithee.'       1
 and-prithee     1

Congrats - you've learned that "prithee" appears 96 times in Shakespeare.

Accessing the status webpages

If you're not on the bigdata network, you can access at least some of these pages by tunneling over ssh:

mymachine:~> ssh -f -N -L 50030:hadoop-client:50030 -L 50070:hadoop-client:50070 @bigdata.intel-research.net

This will create two tunnels, one for the JobTracker (port 50030) and one for the NameNode (port 50070).

Once the tunnels are set up, direct your web browser to http://localhost:50030 for the JobTracker status and http://localhost:50070 for the filesystem status. Unfortunately, you won't be able to access everything available on the web via those tunnels - you can't access links that go directly to other machines in the cluster or over other ports (e.g., browsing the dfs and accessing the detailed logs). For that, you'll have to be on the ILP network.

Writing Hadoop MapReduce programs

The Hadoop Wiki describes the basics of MapReduce and provides a few example programs. All of these example programs are in the hadoop-0.12.3-examples.jar jarfile, and their source is in /usr/local/hadoop/hadoop/src/examples/org/apache/hadoop/examples. Alas, Hadoop's documentation is mediocre at best, so starting with code is probably most useful.