Global Monitoring

Monitoring across multiple Open Cirrus sites is done via Ganglia. The central point of data collection is in the Big Data cluster at ILP. A Ganglia web interface displaying some machine statistics is available here.

Below are some instructions about how to deploy Ganglia at your site as well as some information about some of the additional information (metrics) we added in our Ganglia installation.

Installing Ganglia

We have installed Ganglia 3.1.2 because the version that is the default on most distros (2.5.x) lacks the ability to spoof metrics. The Ganglia website mentions that it should be possible to combine data from different versions, but we have not tested this yet. A spoofed metric is one that is provided by one host but applied to another. The description of our monitoring of power usage below gives an example where this is useful.

If you choose to install 3.1.x, you'll likely need to get it from somewhere besides your operating system's default repository. You can either compile it from source, or, if you are using 64-bit Ubuntu Hardy (8.04), download packages that I have built:

If you choose to install one from the 2.5.x series, you can likely just install it using yum, apt-get, or whatever package manager is your favorite.

Configuring Ganglia

It might be useful to give a very quick description of what Ganglia consists of. There are two primary components to a Ganglia installation, gmond and gmetad. gmond collects metrics about the machine it is on, gathers statistics from other machines on the network, and produces XML output that displays all metrics for the consumption of gmetad (or other tools). gmetad collects statistics from a gmond and uses RRDs to provide historical information. It is also used when lookups are performed by the web interface.

Typically, it is not necessary to adjust the default configuration provided with Ganglia. If you're going to use the default multicast address and simply expose port 8649 (the default port that gmond listens on to dump XML), it will likely be enough to install gmond on every node in a cluster and open a hole in the firewall for tcp access to port 8649 for atleast on node internally. It must be possible for to connect to your gmond (or gmetad) instance in order for the gmetad running here to access your site.

There are some configuration changes that may need to be made, however. At our site, the default multicast IP was changed since another user of the cluster already had an installation of Ganglia running and we didn't want to mix the data. Additionally, the frequency of metric collection was upped at our site. If your site has multiple multicast domains, it will be necessary to either give access to atleast one machine's gmond in each domain or run a gmetad installation yourself and then expose it to our gmetad (gmetad by default uses tcp ports 8651 and 8652).

Additional Steps (optional)

We have added additional metrics to Ganglia in order to monitor other properties of machines. One rich soure of machine data is IPMI, which can provide thermal, power, and system error data depending on what hardware configuration you have. These values can easily be converted into metrics using Ganglia's gmetric command line tool and a simple script. In addition to this, we also collect power usage data via SNMP for some nodes and spoof it as a metric for those machines (this is why we needed to use something newer than Ganglia 2.5.x). Using sar to gather metrics is also useful if you are interested in disk performance. Virtually any machine metric can be provided to Ganglia via the gmetric command line tool.

In addition to this, many software packages have user written addons that allow the collection of job data as Ganglia metrics including Hadoop, Maui/Torque, and Tashi.