Wed, 09 Mar 2011
Introducing NRPE Runner
It might be a sign that I spend too much time online but the quicker a system gives me feedback the more useful I find it. While I love knowing my Nagios safety net has me covered when making changes sometimes waiting for that cgi to refresh can take too long, especially if I'm taking a iterative / test driven approach to the changes I'm making. For those use cases I wrote nrpe-runner.
The way I typically use Nagios is to have the Nagios server run the
checks on the remote host via the NRPE plugin. The checks to be run on the host are
normally stored in a config file with each entry looking like this:
While this allows you to run each check to confirm that it's still OK I
wanted the ability to run all the commands in the file at once, which I can
now do with
nrpe-runner. If every thing's fine then
it exits silently, to confirm that it's actually run I can summarise and
even filter the checks to run:
# show everything as it's run whatever the return status /usr/local/sbin/nrpe-runner -a check_swap => SWAP OK - 100% free (16041 MB out of 16041 MB) |swap=16041MB;12031;9624;0;16041 ... snipped ... freemem => OK: 12% (1732M) free memory. # show a summary $ /usr/local/sbin/nrpe-runner -s Ran 39 checks - OK 39. WARN 0, CRIT 0, UNKNOWN 0 # run any checks with ntp in the name (the part between ) $ /usr/local/sbin/nrpe-runner -s -n ntp Ran 3 checks - OK 3. WARN 0, CRIT 0, UNKNOWN 0 # run all process checks (checks the command after the '=') $ /usr/local/sbin/nrpe-runner -s -c proc Ran 17 checks - OK 17. WARN 0, CRIT 0, UNKNOWN 0 # show all checks named ntp $ /usr/local/sbin/nrpe-runner -a -n ntp ntp_skew_primary => NTP OK: Offset -0.003149271011 secs|offset=-0.003149s;5.000000;9.000000; ntp_process => PROCS OK: 1 process with command name 'ntpd', args '-u ntp:ntp' ntp_skew_secondary => NTP OK: Offset -0.002887368202 secs|offset=-0.002887s;5.000000;9.000000;
nrpe-runner also has the option to dump the results as json, which I'll be exploring a little further in my next couple of blog posts. While it's not exactly the same as having the checks run by nagios (the user and environment are often different) I've found that shortening the interval between running puppet or yum and seeing the nagios feedback has helped my work-flow quite a lot when making exploratory system changes - and even more when nothing should have changed but does...
Wed, 07 Apr 2010
Pigz - Shortening backup times with parallel gzip
While searching for a completely different piece of software I stumbled on to the pigz application, a parallel implementation of gzip for modern multi-processor, multi-core machines. As some of our backups have a gzip step to conserve some space I decided to see if pigz could be useful in speeding them up.
Using remarkably unscientific means (I just wanted to know if it's worth further investigation) I ran a couple of sample compression runs. The machine is a quad core Dell server, the files are three copies of the same 899M SQL dump and the machine is lightly loaded (and mostly in disk IO).
####################################### # Timings for two normal gzip runs dwilson@pigztester:~/pgzip/pigz-2.1.6$ time gzip 1 2 3 real 2m43.429s user 2m39.446s sys 0m3.988s real 2m43.403s user 2m39.582s sys 0m3.808s ####################################### # Timings for three pigz runs dwilson@pigztester:~/pgzip/pigz-2.1.6$ time ./pigz 1 2 3 real 0m46.504s user 2m56.015s sys 0m4.116s real 0m46.976s user 2m55.983s sys 0m4.292s real 0m47.402s user 2m55.695s sys 0m4.256s
Quite an impressive speed up considering all I did was run a slightly different command. The post compression sizes are pretty much the same (258M when compressed by gzip and 257M with pigz) and you can gunzip a pigz'd file, and get back a file with the same md5sum.
# before compression -rw-r--r-- 1 dwilson dwilson 899M 2010-04-06 22:12 1 # post gzip compress -rw-r--r-- 1 dwilson dwilson 258M 2010-04-06 22:12 1.gz # post pigz compress -rw-r--r-- 1 dwilson dwilson 257M 2010-04-06 22:12 1.gzs
I'll need to do some more testing, and compare the systems performance to a normal run while the compression is happening, before I trust it in production but the speed ups look appealing and, as it's Mark Adler code, it looks like it might be an easy win in some of our scripts.
Wed, 30 Sep 2009
Rake - surprisingly enjoyable
I've never really liked make files, I don't think I've ever had to write enough C to really appreciate (or just tolerate) them, so I was a little dismissive of Rake - and I was mostly wrong.
Now we're adding a new member to the systems team I've been doing a lot
of thinking about our tool chain - what knowledge assumptions it makes,
which parts are still more manual than I'd like and where the tool chain has
gaps (this is the most annoying one for me) and rake seemed like a
potential addition to encode some of that process knowledge in to a tool.
I've only added little rakefiles here and there but they do make certain
tasks nicer (plus I like the inline
I've not yet worked out any general rules for when to use a shell script and when to use rake but if nothing else it's helping me spend some time on my ruby skills. The best rake starting points I found were Martin Fowlers rake article and the rake release notes.
Wed, 01 Jul 2009
dstat - a window to your system
When it comes to Unix diagnostics I was raised the old fashion way, with iostat, vmstat and similar tools. However times change and tools evolve. dstat, while not as comprehensive as using all the tools one by one, provides a wide range of system performance details in an easy to use package.
While it's useful enough in its default state there is even more
functionality lurking just below the surface. To see which other modules
are available (but are not enabled by default) run
list. To add an extra module to the output use a command like
dstat -a -M topmem -M topcpu
As part of my growing use of the tool I've started to write my own little dstat plugins. I was pleasantly surprised at how easy they were to write and deploy even with my basic python skills. While the memcached plugin was a proof of concept I've not needed much I've found the process count plugin to be very handy.
dstat is becoming one of the overview tools I use when investigating performance issues and it's worthy of a place in your toolbox too.
Mon, 09 Mar 2009
Puppet Scripts - extract-report-issues
I spent a little while digging through the default puppet log types the other day and after reading through a batch of activity logs I whipped up extract-report-issues, a script that can be run on the command line (or daily via cron) and displays a list of errors and warnings from the specified glob of hosts and log files. By default it does all hosts for the current day, we've got it running nightly so we can work through the issues each morning. It's worth noting that sometimes in the output the same failure occurs more than once. This is because puppet retries certain operations - such as retrieving a resource.
There is actually a lot of useful information in the puppet reports. To start with I've added a todo item for a script that notes persistent errors (the same issues over two or three runs) that I'll hopefully get to this month. Maybe.
If you're running puppet in production you owe it to yourself to turn on reporting and set up some processes around it. While puppet makes it easy to perform action at a distance you still need to close the loop somehow.
Tue, 03 Feb 2009
Simple, Single Document Bookmarks in vim
I like vim, I think it's a great editor worth investing time and effort in to learning but I also think it's one of the most horrible things to watch an inexperienced user typo his way through while you're urgently waiting for them to finish the damn edit. My favourite one this week (and it's only Tuesday) is looking for probably unique phrases that you can later search for to return to a specific part of a document.
In an attempt to stop my laptop getting any more back of the head shaped
dents in it from when I've failed to restrain myself I thought I should
point out a much simpler way of doing this. Once you're at the part of a
document you want to return to press
sets a mark. To return to it press
'<letter>. That's it.
No more pasting in chunks of a string hoping it only occurs once in the
damn document. If you need to mark a couple of locations then fine just use
different letters to set and return to the places you want. And save me
sending another laptop back in for warranty.
Wed, 14 Jan 2009
Soon to be With Added Git?
Despite setting up my own gitweb install I'm still not using git regularly enough to be comfortable with it so today I went through the Peepcode Press Git Internals book/PDF. While the diagrams and details of what happens under the cover are useful it's the wrong level for me as a basic user. To ease myself in to the move from subversion for some of my personal projects I found Git Magic to be more useful.
I know git requires a mental shift and it's a very complex and powerful tool but for my own needs I'll probably never use more than 10% of its capabilities. Unfortunately most of the projects I use and need to submit patches to have switched - so I'll be a happy sheep and go along for the ride. Even if it turns out to be a roller coaster.
Tue, 06 Jan 2009
Diffing Files Over Multiple Servers - rd-differ
Adhoc changes are a very bad thing in many ways, one of the worst is how often they are not fully implemented across all the servers or even pulled back to staging. In an attempt to sanity check the config files when we have to make these little hacks I oddly-proudly present - rd-differ. A tool for diffing config files over multiple machines.
The idea is simple, you tell it the file or directory you're interested
in, specify a single machine as the baseline and then specify a number of
others as the machines to check against it. A sample invocation looks like this
rd-differ /etc/apache2 10.10.100.111 10.10.100.112
10.10.100.113 and the output is show as a diff.
The files are rsynced down using ssh so your usual keys will work and while the normal output is that of the raw diff it's very easy to wrap the results and add other checks on top of it. The shell's not written to be very defensive (unusual for me) but the code is short enough that it's worth the compromise.
Sat, 08 Nov 2008
Rebooting Via Proc and the magic sysreq key
You know what the best way to start the day is? I'm pretty sure that it doesn't include a production web server putting its file systems in to read only mode. When this happens most local commands don't work - init, shutdown, telnit and reboot all stop being useful and you have to resort to desperate measures... and here's the desperate measure of the day.
First, check that your system supports the magic sysreq key -
$ cat /proc/sys/kernel/sysrq 1 # nonzero is good
Now you know you have the power to destroy your system through a single incorrect character, have a look at the Redhat Sysrq command reference (you want the 'sysrq' section). We tried to make it sync the disks and reboot - your requirements may vary.
root@web02:~# echo s > /proc/sysrq-trigger root@web02:~# echo b > /proc/sysrq-trigger # machine reboots
As techniques go this one's a little obscure but it's very useful in the right circumstances.
Sat, 23 Aug 2008
Nagios Service and Hosts stats - Graphed in Munin
We've been hitting some load issues on one of our monitoring machines recently and while it looks like the munin graph generation is the culprit we also decided to keep an eye on how many services and hosts Nagios was checking.
One of the downsides of having a very automated server deployment system is how easy it is to suddenly find yourself with an extra dozen hosts you no longer really need. While each check is quite small and quick, add up the frequent runs and multiply it by a reasonable number of servers and you can soon hit problems.
So as a first step towards keeping an eye on those numbers we now have a munin Nagios hosts plugin and a munin Nagios services plugin that show the total number of hosts and services monitored and the states those resources are in.
Nagios Checks - Validate HTML and Validate Feed
As part of my ongoing attempt to stop myself from silently making mistakes (I don't so much mind the ones I notice) I've added another couple of Nagios Plugins. This time validate_feed and validate_html.
As both of these checks call out to an external, third party resource, if you use them be sure to tweak your Nagios polling interval down to a respectful level.
Thu, 14 Aug 2008
Filter syslog logs with syslogslicer
While digging through a pile of syslog log files recently I needed something a little more data format aware than pure grep. So I present the first version of syslogslicer - a simple perl script that knows a little bit about the syslog log file format.
# some example command lines syslogslicer -p cron -f program,message /var/log/syslog # print the program and message for all lines with program 'cron' syslogslicer -p cron -m hourly /var/log/syslog # all fields for all lines with program 'cron' and message 'hourly' syslogslicer -p cron -m hourly -s 20080810100000 -e 20080810123000 /var/log/syslog # all fields for all lines with program 'cron' and message 'hourly' # between 20080810100000 and 20080810123000
syslogslicer allows you to filter the output by matching text in the program or log message, only print certain output fields and do basic time based filtering. If you've ever wanted to see all the logs raised by postfix with the word 'database' in them between 10 and 11 am then this might be the tool for you.
Nagios - Check Proxy Check
"This script retrieves a URL via a specified proxy server and alerts (using the standard Nagios conventions) if the request fails."
We're running a couple of services through a proxy server for a number of good, and to be honest a couple of not so good but mandated, reasons. The Check Proxy Check Nagios Plugin ensures that if the proxy goes down in a way that stops us pulling pages through it we know.
Wed, 13 Aug 2008
Nagios Disk Check - Mountpoint or Filesystem?
If you mount filesystems under a specific mount point, and monitor them with Nagios, then be sure you understand what happens if the underlying file system goes away. With:
/usr/lib/nagios/plugins/check_disk -w 15% -c 10% -p /a_mount_point
you'll get the value from the containing file system. In this case
/. If you'd rather know that your chosen mount point has
actually gone away, and that you're no longer checking what you thought
you were, then add the
-E option to the command. This will
turn on exact path matching and catch that kind of error.
Testing the 'Net isn't there with Nagios
We've recently had to deliberately disable some machines this week to ensure they can't connect out to the internet - we're building testing versions of some of our more restricted secure environments and this is one of the steps.
It was actually easier to do with IPTables than I thought (mostly because I didn't have to do it - my co-worker did) but once the work was done we needed to ensure it didn't accidently get broken so that networking was functional again. And yes that's an odd thing to type. So naturally we turned to Nagios and so, for my own memory as much as anything else, here is the check we're using:
# put this in the machines nrpe config file. /usr/lib/nagios/plugins/negate -t 30 "/usr/lib/nagios/plugins/check_http -w 5 -c 10 -H www.google.com -u /"
In the Nagios 'Status Information' field you'll get a message that
looks like this -
CRITICAL - Socket timeout after 10
seconds - but the check returns the correct error code so it's
Tue, 12 Aug 2008
I've never really felt as proficient with apt and dpkg as I did with RPM. There always seems to be another option I've never seen before. Luckily there are also big holes in my knowledge of yum to make me feel well rounded.
After reading yum options you may not know exist and spending a while puzzling out how to get the same results in Debian (apt-file seems to be the closest fit but I never got the invocation right) I decided to write dpkg-provides.
It's not packaged, doesn't have a manpage, requires the network and isn't integrated with the existing tools. At least I know how I'd get the information now - from the web. Who'd thought it?
Note: it's actually quite simple to work out which package provides a file
that you've got installed locally (
dpkg -S '*/df') - it's more
of a pain to probe packages you don't have installed.
Tue, 08 Jul 2008
Dear Lazyweb - Command Line YSlow!
The title pretty much says it all, I'd like a command line version of YSlow! (what is it with Yahoo and !s) that I can run from
cron and import in to a nice
spreadsheet for trending and site comparisons.
I don't have XUL on my list of things to play with so I'll give it a couple of months and watch someone else implement it. Hopefully.
Mon, 25 Jun 2007
Navigating Commented Config Files
The current trend with config files is to fill them with comments (let's ignore the fact this isn't a substitute for documentation) and while this is helpful watching people arrow through them line by line looking for active options drives me nuts.
If you're using vim (as all good people
do ;)) you can jump from uncommented directive to uncommented directive
/^[^#] as a search. Pressing
n will then
move you to the next uncommented option. And save me from pulling out those
precious few hairs I have left.
Sun, 03 Jun 2007
Nagios - Simple Trender
Continuing the release of my Nagios code - here's my Nagios Simple Trender. It parses Nagios logs and builds a horizontal barchart for host outages, service warnings and criticals. It's nothing fancy (and the results are a little unpretty) but it does make the attention seeking services and hosts very easy to find.
While the tool isn't that technically complex I've found it useful in justifying my time on certain parts of the infrastructure. Being able to show how bad NTP is for example (we had 216 NTP sync problems last month, this month we had 36; and most of those are one machine with a bad clock) on a very simple chart makes it easier to get buy in from above. And next month you can show them how much of a positive impact the work had.
The Nagios Tag Cloud
We use the Nagios monitoring system at work (in fact we use four installs of it for physically isolated networks) and while it's damn useful (and service checks are easy to create or extend) it's a little lacking in higher level trending and visualisation tools. Well, at least the very old version we run suffers from this.
Thankfully I work for a company that invests time in its core tools. Over the last couple of hackdays I've written two small scripts for parsing Nagios logfiles and presenting the information in a different, slightly more grouped way. The first of these is the Nagios TagCloud - which has a very descriptive name :)
When invoked (I typically use
nagiosclouds.pl /log/files/*.log >
/webdir/nagios_tagcloud.html from a cronjob) it'll run through the log
files and produce a HTML page containing 3 tag clouds, one for host
outages, one for service warnings and one for service criticals. Tag clouds
don't suit everyones work style but I came away from running ours with a
couple of action points so I think they're useful enough to glance at once
I should note the perl module that generates the tag cloud is Leon Brocards HTML::TagCloud and the CSS was graciously given to me by Alex Monney after he burned his eyes looking at my first version.