Nagios Wrapped Puppet Runs

<tl;dr>Log nrpe-runner state changes when puppet runs to see what broke or was fixed.</tl;dr>

While people most often use puppet to configure and repair their infrastructures sometimes they also inadvertently use it to damage and cripple them. As part of my attempt to reduce the mean time to spot a mistake across my systems I’ve come up with a handful of small scripts that let me wrap a puppet run in a Nagios NRPE powered safety net.

One of the lesser known features introduced in Puppet 0.25.4 (and still valid in 2.6) were the prerun_command and postrun_command hooks. These two config settings allow you to specify a command to run at the beginning (which can stop the puppet run from happening) and at the end of a puppet run. While they were originally devised to make integration with etckepper simpler we can also use them to add some additional monitoring to our runs.

We’ve already covered my nrpe-runner, which lets you run Nagios checks locally for immediate feed back but now let’s expand the idea a little for puppet integration. Our plan is simple, invoke nrpe-runner and gather the output, run puppet, re-run the nrpe-runner and see which checks puppet has fixed or broken.

First of all we deploy nrpe-runner, our nrperunner json differ and the (below) wrapper script we use for when puppet’s finished running.

$ cat nrpe-wrapper

#!/bin/bash
/home/deanw/puppet-wrapper/nrpe-runner -j > /tmp/post_puppetrun 
logger -t "puppet-nrpe" `/home/deanw/puppet-wrapper/nrperunner-json-differ /tmp/pre_puppetrun /tmp/post_puppetrun`

We then add the config to puppet.confs main section. While it’s possible to insert longer lines for each command and skip the wrapper script puppet is a little fiddly about these settings and a separate script is easier to use.

$ cat /etc/puppet/puppet.conf
[main]
  ... snip ...
    prerun_command  = /home/deanw/puppet-wrapper/nrpe-runner -j > /tmp/pre_puppetrun
    postrun_command = /home/deanw/puppet-wrapper/nrpe-wrapper

Now we’ve done all the prep (and if needed restarted puppet) let’s break something and see if we get both a fix and confirmation:

# stop something we know puppet will fix.
$ /etc/init.d/mcollective stop

$ puppetd -vt
info: Retrieving plugin
 .. snip ...
notice: //mcollective::server/Service[mcollective]/ensure: ensure changed 'stopped' to 'running'
notice: Finished catalog run in 5.51 seconds

# see if we logged the fix... we did!
$ tail -n 1 /var/log/messages
Mar 21 22:07:21 lb03-dynm puppet-nrpe: mcollective_procs changed from 2 to 0

While our simple wrapper just sends the output directly to syslog hopefully you’ve got an idea how powerful this integrated immediate feedback can be. While it’s always been possible for us to dig back through the logs and spot something breaking after a puppet run, by explicitly wrapping the run we can cut done the investigation time while also providing information for later review and discussion.