This is partially for my own reference, and partially for the benefit of Googlers out there.
I've a load of SuSE 10.1 boxes, 58 times SunFire X4100 boxes, as it happens, running in three clusters. I'd like to be proactive with regards to monitoring them for hardware failures, as it's a pain in the butt to run upstairs and look at them every day, then try to cipher out what that particular flashy light is trying to tell me. We have a Nagios installation already (one monitoring the head nodes, and each cluster has or will get its own installation on the head nodes to monitor the compute nodes), so naturally I'd like to make use of that.
I found check_ipmi_sensors.pl (nagiosexchange is a great site, despite its unfortunate nomenclature), so that takes care of the server side. I had some difficulty interpreting its output at first, but eventually sorted it out. My primary Nagios server runs FreeBSD, so "pkg_add -r ipmitool" helped there, and for the SuSE boxes, "rug install ipmitool".
SuSE doesn't load the IPMI sensor modules by default, nor did I find an init script, but for 10.1, the following commands worked:

# modprobe ipmi_msghandler
# modprobe ipmi_devintf
# modprobe ipmi_si

After that, some useful commands to know are:
ipmitool sensor
ipmitool sdr list
and (for instance)
ipmitool sensor get pdb.t_amb
The Nagios perl plugin could use some work - it returns CRITICAL when the sensor says "nc". For instance:

# ipmitool sensor get pdb.t_amb
Locating sensor record...
Sensor ID : pdb.t_amb (0x1b)
Entity ID : 19.0
Sensor Type (Analog) : Temperature
Sensor Reading : 32 (+/- 0) degrees C
Status : Upper Non-Critical
Lower Non-Recoverable : 0.000
Lower Critical : 0.000
Lower Non-Critical : 0.000
Upper Non-Critical : 32.000
Upper Critical : 37.000
Upper Non-Recoverable : 42.000
Assertions Enabled : ucr+ unr+
Deassertions Enabled : ucr+ unr+
#

But:

# perl check_ipmi_sensors -H myhostname -u myuserid -p mypassword
IPMI_SENSORS CRITICAL - pdb.t_amb: nc
#

Methinks that should return WARN. Once I get a better handle on how to fit everything together, I'll either fix the script myself and send patches to the author, or just make the suggestion to him.
(Updated to fix typoes.)


Published

Category

Technology

Tags

Contact