In the T-DOSE Zabbix talk, which I’m happy to say was both well presented and showed some interesting features, I got called out for a quote I made on Twitter (which just goes to show - you never know where what you said is going to show up and haunt you) about the relevance, and I’d say overemphasis, of the GUI to the zabbix monitoring system - and other monitoring systems in general. Read on →

One antipattern I’m seeing with increasing frequency is that of obese (or fat, or bloated) system provisioning. It seems as common in people that are just getting used to having an automated provisioning system and are enthusiastic about its power as it is in longer term users who have added layer on layer of cruft to their host builder. The basic problem is that of adding too much work and intelligence to the actual provisioning stage. Read on →

One of my little side projects is moving an old, configured in little steps over a long period of time, website from apache 1.3 to a much more sensible apache 2.2 server. I’ve been thinking about how to get the most out of the testing I need to do for the move and so today I decided to do some yak shaving and write some simple regression tests, play with Cucumber Nagios, rspec matchers and write a little ruby. Read on →

I’m a fan of documentation, over the years I’ve ended up supporting more than one business critical system that has less documentation than you get from a cat /dev/null. The only downside, and I’ve been bitten by a couple of things like this over the last week is the case of the spreadsheet vs the post-it note - if you have a lovely, well formatted and information dense spreadsheet that says “A is 1” and when you get to the server room the switch has a post-it, in bad scrawl, that says “B is 2” which do you believe?


We’ve recently been searching for a junior sysadmin to join the team (and I’m very happy to say we’ve now succeeded) so as part of my day to day tasks I had to come up with a dozen simple questions to weed out the people that have never used anything but webmin (and there is a surprising number of them out there). One of the questions seemed to cause a lot of trouble in the general sense and tripped up the few who even made an attempt - “How would you change all occurrences of to in a text file? Read on →

When it comes to the list of problems ‘our uptimes are too high’ isn’t normally in the top five that sysadmins dread. While having a lengthy uptime used to be a boasting point it can also hide technical issues - such as kernel upgrades you’ve applied but not enabled (unless you’re running something special like ksplice), confidence gaps in high availability systems (when was the last time you did a fail over?) and a general worry that what’s running on a host now may not be when it comes back up. Read on →

We recently consolidated a number of websites used by one of our brands back down to a sensible number (sensible being one). Which, while only a single action point on an email, turned out to be a large amount of DNS and apache vhost wrangling. In order to give myself a safety net, and an easy to check todo list, I decided to invest ten minutes in writing a small test script. Read on →

At work we both build our own packages and use puppet to manage our servers. While the developers package up their work in the systems team we’ve moved more to deploying programs and their dependencies via Puppet. While it seems easier, and quicker, to do the pushing that way, at least for scripts, you lose the ability to track what’s responsible for putting each file on the system. I’m probably already modelling the more complex parts of what would be in a package (such as services and cronjobs) in the module and thanks to Puppet I’m probably doing it in quite a portable way. Read on →

Nagios has a wonderful ‘duration’ column in its web interface that’s always bemused me. At what point does a check being in a warning, or even worse, a critical state stop being a problem worthy of head space and start being normal operating procedure? Checks can stay in an extended broken state for many reasons but they all seem to be symptoms of a larger problem. If it’s a small thing then are you getting enough time to do housekeeping? Read on →

We recently had an odd one where the Nagios check_http check, which was both checking for the presence of a string in the response and that the page loaded in a certain time frame, went from reporting a ‘CRITICAL - string not found’ to a ‘HTTP WARNING: HTTP/1.1 200 OK’. My first thought, as this was a site pending migration, was that the URL had moved to a slower machine with the fixes released to it. Read on →

Logs are a wonderful thing. If done correctly they point out the source of all errors, show you what’s running slow and contain useful information on how your system is running. At every place I’ve ever worked they’ve been busy, full of odd one offs and too often overlooked. I’m going to be doing a fair bit of log processing next week so expect lots of little toolchain scripts like syslog-splitter.pl to be checked in to git and mentioned here. Read on →

Now that chef is out and about people that accepted the massive improvement over all the existing host configuration managers that is Puppet will probably be casting a weary eye its way. I’ve got a little too much in puppet at my current position to look at moving for a while yet but now the competition is rising its time to get my boot in and point out what, for me, is the worst part of puppet; how difficult it is to add new types. Read on →

Penetration testing is tactical. It provides tangible, actionable information – Ivan Arce It’s been a while since I’ve been involved in pen testing but the above quote from Ivan is perfect and its meaning all too often overlooked. When you invest the time in something like pen testing or performance tuning you should always come away with a list of actionable tasks. By doing this you ensure the work wasn’t pointless (or if it was avoid repeating the mistake) and have something you can present to stake holders to get buy in for the next time. Read on →

It’s been another day of many DNS changes and while the work itself has been amazingly dull, life draining, scut work at least one positive thing’s come out of it - my appreciation for the Net::DNS perl module has grown. While it’s possible to do nearly anything DNS query related with the dig command it’s a lot easier to extract the data and reuse certain fields if you have access to a decent data structure rather than grepping bits of text out. Read on →


You’ve gathered the requirements, written the code, debugged it, received the new requirements, rewritten the code, got more change requests, reached a ‘compromise’ with QA (and hidden the bodies) and now you want to have the sysadmins do the release. Don’t be like everyone else - when it comes to releases too many people fail at the last mile and make obvious mistakes. In an attempt to save myself some pain (and have something to point co-workers at) here are some of the software release principles that I hold dear. Read on →

My recent bugbear is - servers with inaccessible memory. You go and spec a nice new server with say 8Gb of RAM (a little box), you install Debian, you start adding applications to the machine and then a couple of months later some anal sysadmin comes along, does a `free -m` and mutters about under-specced virtualization servers when he sees total used free shared buffers cached Mem: 3287 225 3062 0 24 149 For those of you not paying attention - the machine isn’t using over half of it’s memory. Read on →


Autonomics refer to the ability of computer systems to be self-managing. – autonomics.ca Here’s one that has been bothering me. Suppose you have a recurring problem that your “autonomic solution” can handle every time it occurs without any one knowing. At what point does the fact there is a treatable issue propagate up to a real person? While an automatic “fix and tell me later” approach helps change your work from fire fighting to planned tasks what classifies a temporary problem as being important enough to warrant you investigating it? Read on →

I’m a sysadmin, half my working life seems to be spent handling other peoples requests (which is why I’m trying to move over to infrastructure work - where I can hopefully concentrate on something for three whole minutes). While chatting with a junior admin at a tech talk in the week the following three tips came up: Use a ticketing system. This one comes up a lot but it’s true, never dropping someones request is well worth the time spent setting it up. Read on →

After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. Read on →

Here is another one for the sysadmins in the audience: How … … many of your servers have multiple network ports in the back? … many of them have bonding (teaming for the Windows people) enabled? … do you know when one interface goes down if the machine stays connected? … long does it take for you to be notified? … do you know if they start flapping? … many have their bonded interfaces plugged in to different switches? Read on →

This came up in conversation with a developer at the Google OpenSource Jam so I thought I’d mention it while it is fresh in my mind (update: at which point I forgot to move it to the published directory. Doh). Breaking up config files isn’t done just to annoy people, it’s done to make automated and mass management easier. A solid practical example is the Debian Apache configs. Historically most distros (and too many current ones) used a single config file for Apache. Read on →

When you’re first introduced to an environment you’ll have the ever fun task of working out which machines should get the most time; and that order seldom matches which machines actually need the most attention. To help me prioritise I’ve worked out a simple importance rating system to show where I spend my time. Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I’ve got in my todo list for that level. Read on →

Although it’s a rare Unix machine that doesn’t run at least a couple of custom cronjobs it’s an even more special snowflake that does them properly. Below are some of the more common problems I’ve seen and my thoughts on them. Always use a script, never a bare command line. A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says “I didn’t really think this through” and “I’ve done the bare minimum to make it work” in quite the same way. Read on →

“The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.” – StorageMojo - Google’s Disk Failure Experience There have been two excellent papers on disk drive failures released recently, the Dugg and Dotted Google paper - Failure trends in a large disk drive population (warning: PDF) and the also excellent but less hyped Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. Read on →

Here’s one for the sysadmins in the crowd; if you were asked to show the following how long would it take you to gather the information? Which of your file systems have the fastest growth rate? Which are the most under-utilised? Which haven't changed by more than 5% over the last month? If you use Nagios you can cheat and work out the full drive size from the free space and percentage used reported by the disk checks, but that’s… icky. Read on →


A number of Unix/Linux people seem to pride themselves on obtaining the highest uptime they can. While this may seem like a little harmless fun, in a production environment (which are mostly fun-free places), it can hide a number of problems that will later become major issues. At some point the machine will have to come down and face a power off or reboot, and then it’s expected to come back up, and this is where the problems can start. Read on →


Although it actually sounds pretty fast, when you actually start benchmarking it, Gigabit Ethernet isn’t quite as good a solution as you’d think. As more and more commercial deployments move to using SANs and NAS for online storage and backups it’s increasingly easy to saturate existing LANs. One possible solution as people start to look at 10 and 100Gbps networks is FireEngine (PDF), a set of architecture changes and improvements for Solaris 10. Read on →

I hate to jump on any bandwagon that starts at Slashdot, although even a broken clock is right twice a day, but I find myself agreeing with a number of the Slashdot comments made about the new WS-Management spec. Firstly, and most importantly, SNMP is still the most widely used management protocol in production. Secondly it has survived the invention of a number of replacements, WBEM and CIM spring to mind as standards chosen to replace a lot of the functionality it provides; oddly enough those specs were also backed by Microsoft and Sun. Read on →


“Its only running a single service, we’re fully patched and it has a local firewall that denies by default.” “What happens if i do Ctrl-Alt-Delete?” <h3>Introduction</h3> One of the basic premises of computer security is that it's almost impossible to fully secure any machine to which an attacker has physical access. While we cannot cover all eventualities, we can make some simple changes to catch any use of the more blatant avenues of abuse. Read on →