A few people I follow on twitter mentioned they’d contributed to 97 Things Every SRE Should Know. It’s a book full of short, 1-3 page chapters, focused on topics dear to an SREs heart. So i had no choice but to buy it. In an attempt to be more deliberate with my reading and what I’ve retained from the book I’ve decided to create some reading notes for future me. This post is broken down into a section per chapter. Read on →


Gremlin recently ran a small twitter hashtag challenge called “#talesfromtheNOC” where people were invited to share their scary sysadmin stories. Reading through of the entries I was reminded of one slightly less than welcoming environment that led to a lot of learning, frustration and trepidation. I’ve captured my posts here. Opening Volley I was hired as the new sysadmin at a financial services company and had no hand over as my predecessor had apparently ‘left on short notice’. Read on →

Summer is in the air and it seems like time to replace my entire home lab monitoring system once more. Sensu has been plodding along nicely but I’m in this for the learnin’ so I’m looking for something more interesting that a major version bump and move to Golang. I’m thinking of giving Prometheus a spin to see how it’s changed over the last few years and as a first step I decided it was time to upgrade my local test bed Docker Compose Prometheus and add some bells and whistles. Read on →

For most companies Incident Commander or Incident Manager is not a specific job, it’s a role you may take on when something has gone, often horribly, wrong and you need to quickly unite an adhoc group into a team to resolve it. The incident commander should be the point of contact, and source of truth, about your incident and to do that successfully they’ll need to be updated and kept informed about what’s happening. Read on →

Closing the loop on a monitoring alert is traditionally something that implicitly happens when the dashboard returns to its idyllic green state, the text massage returns a well deserved “Service: OK” or in more extreme cases the incident review is over and actions have been assigned. This however assumes the alert is working well and the operator understands why it woke them up and the value their involvement brings. In more fluid environments alerts can be incorrect, issues that do not require immediate attention and in the worst case ghost calls that mysteriously correct themselves just after you’ve woken up enough to find your MFA device. Read on →

I have a tab that normally lives somewhere near the middle of my web browsers tab bar. Over the course of my day it faces constant pressure on each side. From ad hoc work tabs being opened by the pinned email and slack tab on its left and from proactive work based tabs from its right. I’ve learned I can tell how my week is going by where it is on the bar. Read on →

Just before Christmas I had to do some work on new business continuity plans (BCP) and disaster recovery (DR) documents. To help warm up and get myself in the right frame of mind I posted a few easy opening scenarios to Twitter for comment and I’ve decided to collect them back up and post here, in my external memory, for posterity. Each of these ideas should be considered the most generic and low hanging fruit of your plans. Read on →


I’ve had a half written draft of this post sitting in a folder for the last six months and I’ve not been able to shake the root cause so I’m going to publish it and see what the feedback teaches me. But first the heresy - Service Level Objectives make me uncomfortable. I have no issue with the idea that you need some form of measurement and tracking to ensure you’re maintaining an acceptable level of service but when reading posts on SLOs, or watching recorded conference sessions, the concept seems to imply some rigour and background process to determine the numbers to work towards that feels decoupled from any hard details and often comes across as either a guesstimate or just a Current Representation of Actual Percentages. Read on →


When I first started my Prometheus experiments with docker-compose one of the most awkward parts of the process, especially to document, were the manual steps required to click around the Grafana dashboard in order to add the Prometheus datasource. Thanks to the wonderful people behind Grafana there has been a push in the newest major version, 5 at time of writing, to make Grafana easier to automate. And it really does pay off. Read on →

After adding AlertManager to my Prometheus test stack in a previous post I spent some time triggering different failiure cases and generating test messages. While it’s slightly satisfying seeing rows change from green to red I soon wanted to actually send real alerts, with all their values somewhere I could easily view. My criteria were: must be easy to integrate with AlertManager must not require external network access must be easy to use from docker-compose should have as few moving parts as possible A few short web searches later I stumbled back onto a small server I’ve used for this in the past - MailHog. Read on →

What’s the use of monitoring if you can’t raise alerts? It’s half a solution at best and now I have basic monitoring working, as discussed in Prometheus experiments with docker-compose, it felt like it was time to add AlertManager, Prometheus often used partner in crime, so I can investigate raising, handling and resolving alerts. Unfortunately this turned out to be a lot harder than ‘just’ adding a basic exporter. Before we delve into the issues and how I worked around them in my implementation let’s see the result of all the work, adding a redis alert and forcing it to trigger. Read on →

How much of your system does your internal monitoring need to consider down before something is user visible? While there will always be the perfect chain of three or four things that can cripple a chunk of you customer visible infrastructure there are often a lot of low importance checks that will flare up and consume time and attention. But what’s the ratio? As a small thought experiment on one project I’ve recently started to leave a new, very simple four panel, Grafana dashboard open on a Raspberry PI driven monitor that shows the percentage of the internal monitoring checks that are currently in a successful state next to the number of user visible issues and incidents. Read on →

As 2018 rolls along the time has come to rebuild parts of my homelab again. This time I’m looking at my monitoring and metrics setup, which is based on sensu and graphite, and planning some experiments and evaluations using Prometheus. In this post I’ll show how I’m setting up my tests and provide the [Prometheus experiments with docker-compose](https://github.com/deanwilson/docker-compose- prometheus) source code in case it makes your own experiments a little easier to run. Read on →


As part of operation ‘make my infrastructure look like an adult operates it’ I needed to add some basic uptime/availability checks to a few simple sites. After some investigation I came up with three options, Pingdom, which I’d used before in production and was comfortable with, and two I’d not used in the past Uptime Robot and Status Cake. By coincidence I was also doing my quarterly check of which AWS resources Terraform supported and I noticed that a StatusCake Provider had recently been added so I decided to experiment with the two of them together. Read on →


In the T-DOSE Zabbix talk, which I’m happy to say was both well presented and showed some interesting features, I got called out for a quote I made on Twitter (which just goes to show - you never know where what you said is going to show up and haunt you) about the relevance, and I’d say overemphasis, of the GUI to the zabbix monitoring system - and other monitoring systems in general. Read on →

One antipattern I’m seeing with increasing frequency is that of obese (or fat, or bloated) system provisioning. It seems as common in people that are just getting used to having an automated provisioning system and are enthusiastic about its power as it is in longer term users who have added layer on layer of cruft to their host builder. The basic problem is that of adding too much work and intelligence to the actual provisioning stage. Read on →

One of my little side projects is moving an old, configured in little steps over a long period of time, website from apache 1.3 to a much more sensible apache 2.2 server. I’ve been thinking about how to get the most out of the testing I need to do for the move and so today I decided to do some yak shaving and write some simple regression tests, play with Cucumber Nagios, rspec matchers and write a little ruby. Read on →

I’m a fan of documentation, over the years I’ve ended up supporting more than one business critical system that has less documentation than you get from a cat /dev/null. The only downside, and I’ve been bitten by a couple of things like this over the last week is the case of the spreadsheet vs the post-it note - if you have a lovely, well formatted and information dense spreadsheet that says “A is 1” and when you get to the server room the switch has a post-it, in bad scrawl, that says “B is 2” which do you believe? Read on →


We’ve recently been searching for a junior sysadmin to join the team (and I’m very happy to say we’ve now succeeded) so as part of my day to day tasks I had to come up with a dozen simple questions to weed out the people that have never used anything but webmin (and there is a surprising number of them out there). One of the questions seemed to cause a lot of trouble in the general sense and tripped up the few who even made an attempt - Read on →

When it comes to the list of problems ‘our uptimes are too high’ isn’t normally in the top five that sysadmins dread. While having a lengthy uptime used to be a boasting point it can also hide technical issues - such as kernel upgrades you’ve applied but not enabled (unless you’re running something special like ksplice), confidence gaps in high availability systems (when was the last time you did a fail over? Read on →

We recently consolidated a number of websites used by one of our brands back down to a sensible number (sensible being one). Which, while only a single action point on an email, turned out to be a large amount of DNS and apache vhost wrangling. In order to give myself a safety net, and an easy to check todo list, I decided to invest ten minutes in writing a small test script. Read on →

At work we both build our own packages and use puppet to manage our servers. While the developers package up their work in the systems team we’ve moved more to deploying programs and their dependencies via Puppet. While it seems easier, and quicker, to do the pushing that way, at least for scripts, you lose the ability to track what’s responsible for putting each file on the system. I’m probably already modelling the more complex parts of what would be in a package (such as services and cronjobs) in the module and thanks to Puppet I’m probably doing it in quite a portable way. Read on →

Nagios has a wonderful ‘duration’ column in its web interface that’s always bemused me. At what point does a check being in a warning, or even worse, a critical state stop being a problem worthy of head space and start being normal operating procedure? Checks can stay in an extended broken state for many reasons but they all seem to be symptoms of a larger problem. If it’s a small thing then are you getting enough time to do housekeeping? Read on →

We recently had an odd one where the Nagios check_http check, which was both checking for the presence of a string in the response and that the page loaded in a certain time frame, went from reporting a ‘CRITICAL - string not found’ to a ‘HTTP WARNING: HTTP/1.1 200 OK’. My first thought, as this was a site pending migration, was that the URL had moved to a slower machine with the fixes released to it. Read on →

Logs are a wonderful thing. If done correctly they point out the source of all errors, show you what’s running slow and contain useful information on how your system is running. At every place I’ve ever worked they’ve been busy, full of odd one offs and too often overlooked. I’m going to be doing a fair bit of log processing next week so expect lots of little toolchain scripts like syslog-splitter. Read on →

Now that chef is out and about people that accepted the massive improvement over all the existing host configuration managers that is Puppet will probably be casting a weary eye its way. I’ve got a little too much in puppet at my current position to look at moving for a while yet but now the competition is rising its time to get my boot in and point out what, for me, is the worst part of puppet; how difficult it is to add new types. Read on →

Penetration testing is tactical. It provides tangible, actionable information – Ivan Arce It’s been a while since I’ve been involved in pen testing but the above quote from Ivan is perfect and its meaning all too often overlooked. When you invest the time in something like pen testing or performance tuning you should always come away with a list of actionable tasks. By doing this you ensure the work wasn’t pointless (or if it was avoid repeating the mistake) and have something you can present to stake holders to get buy in for the next time. Read on →

It’s been another day of many DNS changes and while the work itself has been amazingly dull, life draining, scut work at least one positive thing’s come out of it - my appreciation for the Net::DNS perl module has grown. While it’s possible to do nearly anything DNS query related with the dig command it’s a lot easier to extract the data and reuse certain fields if you have access to a decent data structure rather than grepping bits of text out. Read on →


You’ve gathered the requirements, written the code, debugged it, received the new requirements, rewritten the code, got more change requests, reached a ‘compromise’ with QA (and hidden the bodies) and now you want to have the sysadmins do the release. Don’t be like everyone else - when it comes to releases too many people fail at the last mile and make obvious mistakes. In an attempt to save myself some pain (and have something to point co-workers at) here are some of the software release principles that I hold dear. Read on →

My recent bugbear is - servers with inaccessible memory. You go and spec a nice new server with say 8Gb of RAM (a little box), you install Debian, you start adding applications to the machine and then a couple of months later some anal sysadmin comes along, does a free -m and mutters about under-specced virtualization servers when he sees total used free shared buffers cached Mem: 3287 225 3062 0 24 149 For those of you not paying attention - the machine isn’t using over half of it’s memory. Read on →


Autonomics refer to the ability of computer systems to be self-managing. – autonomics.ca Here’s one that has been bothering me. Suppose you have a recurring problem that your “autonomic solution” can handle every time it occurs without any one knowing. At what point does the fact there is a treatable issue propagate up to a real person? While an automatic “fix and tell me later” approach helps change your work from fire fighting to planned tasks what classifies a temporary problem as being important enough to warrant you investigating it? Read on →

I’m a sysadmin, half my working life seems to be spent handling other peoples requests (which is why I’m trying to move over to infrastructure work - where I can hopefully concentrate on something for three whole minutes). While chatting with a junior admin at a tech talk in the week the following three tips came up: Use a ticketing system. This one comes up a lot but it’s true, never dropping someones request is well worth the time spent setting it up. Read on →

After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. Read on →

Here is another one for the sysadmins in the audience: How … … many of your servers have multiple network ports in the back? … many of them have bonding (teaming for the Windows people) enabled? … do you know when one interface goes down if the machine stays connected? … long does it take for you to be notified? … do you know if they start flapping? … many have their bonded interfaces plugged in to different switches? Read on →

This came up in conversation with a developer at the Google OpenSource Jam so I thought I’d mention it while it is fresh in my mind (update: at which point I forgot to move it to the published directory. Doh). Breaking up config files isn’t done just to annoy people, it’s done to make automated and mass management easier. A solid practical example is the Debian Apache configs. Historically most distros (and too many current ones) used a single config file for Apache. Read on →

When you’re first introduced to an environment you’ll have the ever fun task of working out which machines should get the most time; and that order seldom matches which machines actually need the most attention. To help me prioritise I’ve worked out a simple importance rating system to show where I spend my time. Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I’ve got in my todo list for that level. Read on →

Although it’s a rare Unix machine that doesn’t run at least a couple of custom cronjobs it’s an even more special snowflake that does them properly. Below are some of the more common problems I’ve seen and my thoughts on them. Always use a script, never a bare command line. A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says "I didn't really think this through" and " Read on →

“The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.” – StorageMojo - Google’s Disk Failure Experience There have been two excellent papers on disk drive failures released recently, the Dugg and Dotted Google paper - Failure trends in a large disk drive population (warning: PDF) and the also excellent but less hyped Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Read on →

Here’s one for the sysadmins in the crowd; if you were asked to show the following how long would it take you to gather the information? Which of your file systems have the fastest growth rate? Which are the most under-utilised? Which haven't changed by more than 5% over the last month? If you use Nagios you can cheat and work out the full drive size from the free space and percentage used reported by the disk checks, but that’s… icky. Read on →


A number of Unix/Linux people seem to pride themselves on obtaining the highest uptime they can. While this may seem like a little harmless fun, in a production environment (which are mostly fun-free places), it can hide a number of problems that will later become major issues. At some point the machine will have to come down and face a power off or reboot, and then it’s expected to come back up, and this is where the problems can start. Read on →


Although it actually sounds pretty fast, when you actually start benchmarking it, Gigabit Ethernet isn’t quite as good a solution as you’d think. As more and more commercial deployments move to using SANs and NAS for online storage and backups it’s increasingly easy to saturate existing LANs. One possible solution as people start to look at 10 and 100Gbps networks is FireEngine (PDF), a set of architecture changes and improvements for Solaris 10. Read on →

I hate to jump on any bandwagon that starts at Slashdot, although even a broken clock is right twice a day, but I find myself agreeing with a number of the Slashdot comments made about the new WS-Management spec. Firstly, and most importantly, SNMP is still the most widely used management protocol in production. Secondly it has survived the invention of a number of replacements, WBEM and CIM spring to mind as standards chosen to replace a lot of the functionality it provides; oddly enough those specs were also backed by Microsoft and Sun. Read on →


"Its only running a single service, we’re fully patched and it has a local firewall that denies by default.“ "What happens if i do Ctrl-Alt-Delete?“ <h3>Introduction</h3> One of the basic premises of computer security is that it's almost impossible to fully secure any machine to which an attacker has physical access. While we cannot cover all eventualities, we can make some simple changes to catch any use of the more blatant avenues of abuse. Read on →