We have set up a Hadoop cluster on the new blade servers at ILP to support the Big_Data project. There are 48 servers nodes and one master node. Each blade has 8 cores, 8GB of RAM, and 300GB of disk.
The main webpage for the Hadoop project is available here. We're currently running Hadoop version 0.18.0.
To access the cluster, first login to hadoop-client. Hadoop is installed in /usr/local/hadoop/hadoop.
mymachine:~> ssh hadoop-client hadoop-client:~> cd /usr/local/hadoop/hadoop
If you are not connected to the bigdata network, you can access it via ssh through bigdata.intel-research.net.
mymachine:~> ssh <username>@bigdata.intel-research.net bigdata:~> ssh hadoop-client hadoop-client:~> cd /usr/local/hadoop/hadoop [...]
Using Hadoop from the command line
Almost everything you'll want to do with Hadoop (e.g., getting data to and from the dfs, starting and stopping jobs) is accomplished via the command line by running the "bin/hadoop" program. Running "bin/hadoop dfs" accesses the filesystem functions, "bin/hadoop job" lets you monitor and kill jobs, and "bin/hadoop jar" runs code from a jar file. Running any of these without further command line options will print out usage information. For example:
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs Usage: java FsShell [-fs
] [-conf ] [-D <[property=value>] [-ls ] [-lsr ] [-du ] [-dus ] [-mv ] [-cp ] [-rm ] [-rmr ] [-expunge] [-put ] [-copyFromLocal ] [-moveFromLocal ] [-get [-crc] ] [-getmerge [addnl]] [-cat ] [-copyToLocal [-crc] ] [-moveToLocal [-crc] ] [-mkdir ] [-setrep [-R] ]
You can see the files in your dfs home directory (/user/
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -ls Found 1 items /user/swschlos/complete_works_of_shakespeare.txt
Yes, this command-line interface to the dfs is hokey, but it's the best they've got so far. You can also browse the dfs via the web.
Running the example programs
Hadoop comes with a jar file of example programs. Here is how to run them from the command line:
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop jar hadoop-0.12.3-examples.jar An example program must be given as the first argument. Valid program names are: grep: A map/reduce program that counts the matches of a regex in the input. pi: A map/reduce program that estimates Pi using monte-carlo method. randomwriter: A map/reduce program that writes 10GB of random data per node. sort: A map/reduce program that sorts the data written by the random writer. wordcount: A map/reduce program that counts the words in the input files.
wordcount is a good example program to try first. It counts the number of times a word appears in a set of input text files. Wordcount is one of the example programs described in the Google MapReduce paper.
I've put a copy of the complete works of Shakespeare in my home directory, which you can copy for testing.
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -mkdir shakespeare hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -cp /user/swschlos/complete_works_of_shakespeare.txt shakespeare hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop jar hadoop-0.12.3-examples.jar wordcount shakespeare shakespeare-output
This will create many output files, one from each reduce task that was run. For me it was 211 reduces:
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -ls shakespeare-output Found 211 items /user/swschlos/shakespeare-out/part-00000
2280 /user/swschlos/shakespeare-out/part-00001 2196 [...] /user/swschlos/shakespeare-out/part-00210 2202
There are several ways to get data out of the dfs to your local filesystem, but probably the easiest in this case is to just cat all of the files together:
hadoop-client:/usr/local/hadoop/hadoop> bin/hadoop dfs -cat shakespeare-output/* > ~/shakespeare-output.txt hadoop-client:/usr/local/hadoop/hadoop> grep -i prithee ~/shakespeare-output.txt prithee, 60 prithee. 9 tale.-Prithee 1 Prithee 37 prithee 61 prithee; 4 prithee? 1 Prithee, 43 prithee.' 1 and-prithee 1
Congrats - you've learned that "prithee" appears 96 times in Shakespeare.
Accessing the status webpages
If you're not on the bigdata network, you can access at least some of these pages by tunneling over ssh:
mymachine:~> ssh -f -N -L 50030:hadoop-client:50030 -L 50070:hadoop-client:50070
This will create two tunnels, one for the JobTracker (port 50030) and one for the NameNode (port 50070).
Once the tunnels are set up, direct your web browser to http://localhost:50030 for the JobTracker status and http://localhost:50070 for the filesystem status. Unfortunately, you won't be able to access everything available on the web via those tunnels - you can't access links that go directly to other machines in the cluster or over other ports (e.g., browsing the dfs and accessing the detailed logs). For that, you'll have to be on the ILP network.
Writing Hadoop MapReduce programs
The Hadoop Wiki describes the basics of MapReduce and provides a few example programs. All of these example programs are in the hadoop-0.12.3-examples.jar jarfile, and their source is in /usr/local/hadoop/hadoop/src/examples/org/apache/hadoop/examples. Alas, Hadoop's documentation is mediocre at best, so starting with code is probably most useful.