Nagios Plugins

The Nagios monitoring system is a great example of Free Software, powerful, flexible and easy to extend. While it comes with a lot of functionality out of the box you’ll occasionally want to write you own Nagios plugin; which isn’t too hard to do.

The plugins below have all been written by me to help keep my systems under control. They’re functional, released under the GPL and hopefully useful to someone other than me. You can find the source code for each in my Nagios Plugins Github repo.

In addition to these plugins I’ve also written some trending tools to help give you a higher level perspective of what your systems are doing and where your attention should be focused.

Check Ages. The Check Ages check accepts a basedir argument, a glob of files to match and a number of days old that items are allowed to be. It then generates a critical (which includes all the items in violation) if any of them break the threshold.

Where my version differs from differs from the one bundled with Nagios plugins in two major ways (and one minor) - it allows directories to be the target (which is fixed in newer version of the Nagios check), it works on more than one file at a time and its targets can be controlled via a glob.

I originally wrote this for our log centralisation server. If any of the machines don’t send a message in a certain amount of time then I want to know about it. This check basically fulfils that need without requiring me to add a check for every possible directory (which gets messy once you start adding in service directories and other log groups).

Check Cert Expiry. While there are many Nagios checks that work on a certificate used across the network (such as in https or secure mail) I couldn't find one I liked for checking the time remaining before expiry on a local cert. So I present - the Check Cert Expiry Nagios check.

It’s simple, requires certtool and GNU date and should have been written in perl. Although playing with date was interesting in its own way. Now hopefully I can stop getting myself locked out of VPNs…

PkgWatcher. When it comes to servers, some packages should be everywhere, some should be banned and there are always the edge cases - be it a build host that requires GCC or a webserver that needs a full complement of packaged perl modules. While a decent system imaging or ad-hoc change system will help keep the discrepancies down nothing beats a system level check that verifies your assumptions. And PgkWatcher is that check.

PkgWatcher was designed to run under Nagios but works just as well as an ad-hoc command line tool. Although without centralised management, keeping the required and prohibited lists up-to-date and in sync could become a hassle. And now some notes, it understands the RPM and DPKG packaging systems (and it’s pretty easy to add additional ones), it’s written in pure-perl (so it’s easy to move around) and it’s quite forgiving. If a package isn’t on its required or prohibited lists then it does nothing about it. This is both because I’m pragmatic (a good deployment strategy is a better solution to keeping hundreds of machines in check) and because the environment I’ve written it for has a lot of legacy systems. And being overly strict means you never gain any ground.

For full details read the PkgWatcher annoucement post over at my blog.

Check Disk Mounted Disks. Ever been bitten by the "the disappearing partition"? It's there, it's accessible, and it should be persistent across boots... But it isn't! The machine reboots and then you discover that the database partition is no longer visible.

The check mounted disks Nagios plugin looks at the mounted partitions and compares them to what’s in /etc/fstab (minus a couple of things like cd drives, floppy disks, swap partitions etc). And warns if there are any discrepancies that’ll bite you on a reboot. It also round trips and makes sure what /etc/fstab thinks is mounted is actually there.

Check Disk Checks Checker. You start off with a couple of partitions. You add a MySQL instance and put it on a new logical volume. You break its logging out to a different volume group for performance reasons. You take a snapshot for query tuning and mount that. You add a chunk of disk for a short experiment you were going to try... thanks to legacy, laziness and easy to use LUNs you eventually end up with more mount points than you know what to do with. And at the worst possible moment one of them will fill and you'll discover you forgot to add it to Nagios for monitoring. Or you inherit a bundle of crack fueled servers that have been "evolved" and never gifted with decent monitoring.

The check_disk_checker.pl script was written to help find mount points that you’re not monitoring. It scans through your local Nagios NRPE config files, looks at your current mount points, and complains about any mounted partitions that are not being checked according to the local NRPE configuration files. Of course there is nothing to say that what you have locally is what the remote Nagios is polling but that’s outside the scope of this post.

check_disk_checker.pl shells out and grabs a list of all mounted partitions. It then pulls a list of check-disk lines out of any config files matching nrpe*.conf or nrpe*.cfg (our local naming scheme) that live in /etc/nagios. It then extracts the partitions each one checks (it grabs the value following a -p argument) and complains if it doesn’t find a check for each mounted partition. The script can be run under Nagios as a plugin or stand-alone for help controlling a legacy system.

Check Open Ports. A machine should run a defined set of ports, if any of them are not listening you've got a problem. If any others are open then you've potentially got an even bigger problem. The Check Open Ports Nagios Check accepts a list of IPv4 TCP and UDP ports and reports if any of the expected ones go away or any others are detected as listening.

This also partially scratches one of my own itches, I’ve had a couple of daemons (MySQL in particular) start after a package upgrade without my knowing it. With this script and a little cron it won’t happen again. It’s probably worth mentioning that while this script is built to run within Nagios it will work stand-alone.

Note: this script is more for detecting misconfigurations than for security. Most root kits mask the ports they’ve opened so they won’t appear through netstat, which this command uses.

Check Debian Packages. Ever needed to see, via the Nagios web front end, which Debian machines need their packages updating. So I wrote the check_debian_updates.sh Nagios plugin. This is the initial release (which hasn't been hit too hard yet) so be careful about deploying it anywhere but your testing environment for now. I've played with it in my small test environment and it seems to work so feel free to have a look at it. I'll be stressing it, and possibly tidying the code up a little, next week.

In its basic operation, the script just reports how many packages (if any) need updating and returns a CRIT or a WARN to Nagios based upon your thresholds. If you call the script with a -v it will also output the name of all packages that need updating. Which may consume a lot of Nagios front end screen real estate. Due to it running apt-get update it needs some root privileges. I’ll be setting up sudo to let the Nagios user run this as root with no password for both the apt-get update and apt-get upgrade -s (note the ‘-s’ for simulation.) And only for those!

Check Local Mail. One of the annoyances of my (working) life is the build up of mail in obscurely named mailboxes on different machines. While the typical aim is to have all hosts sending their local mail to a central point (for mass filtering and deleting^Wlogging) you - firstly - have to actually implement this change (normally on machines with lots of different mailservers - yum!) and then add a check to ensure that it never gets broken in the future.

I wrote a script that helps with both of these tasks, the Check Local Mail Nagios Script; which does what the name suggests. Once deployed and running with nagios it flags any mailboxes with contents and can both help pinpoint the noisiest machines and serve as a longer term configuration check, if you get the mail working and it gets broken in the future then the check’ll flag it up again and you can just follow the red spots in nagios. It’s not a complete solution (you also want to check where the machine thinks it’s mailing to remotely and that mail actually arrives) but it’s a simple first step.

Check Linux Free Memory. The most hacky of the bunch, included here both for completeness and because we use it at work, is the Check Linux Free Memory Nagios plugin. It does exactly what the name implies but it's got less than intuitive option handling and almost no documentation. It will get cleaned up one day soon. Honest... On the plus side it does work.