2024

Sun, Feb 18, 2024

sysadmin

Incident Initiation: Pinpointing the Precise Problem Point

A question about incident timelines to help you start your day. On Thursday you raise the rate limit on an API end point. On Friday traffic levels start to rise due to a bot. On Saturday the API becomes unavailable due to a DDOS. When did the incident start?

Tue, Feb 13, 2024

sysadmin

Indicators of outage: Status Page Traffic

Sketching out some possible improvements to our monitoring and something that popped into my head was tying traffic spikes on our status page to our alerting system. If we suddenly see a few hundred percent rise in traffic we can be pretty sure something’s not working somewhere on our side. Not quite as much fun as a previous experiment on watching tweets for our name and less than positive sentiment but something that could be interesting to prototype. Read on →

2021

Mon, Jan 11, 2021

sysadmin

97 things every SRE should know - Part 01

A few people I follow on twitter mentioned they’d contributed to 97 Things Every SRE Should Know. It’s a book full of short, 1-3 page chapters, focused on topics dear to an SREs heart. So i had no choice but to buy it. In an attempt to be more deliberate with my reading and what I’ve retained from the book I’ve decided to create some reading notes for future me. This post is broken down into a section per chapter. Read on →

2020

Mon, Dec 7, 2020

sysadmin

Scary sysadmin Halloween stories

Gremlin recently ran a small twitter hashtag challenge called “#talesfromtheNOC” where people were invited to share their scary sysadmin stories. Reading through of the entries I was reminded of one slightly less than welcoming environment that led to a lot of learning, frustration and trepidation. I’ve captured my posts here. Opening Volley I was hired as the new sysadmin at a financial services company and had no hand over as my predecessor had apparently ‘left on short notice’. Read on →

Thu, Jul 30, 2020

sysadmin

Upgrading docker-compose Prometheus - July 2020

Summer is in the air and it seems like time to replace my entire home lab monitoring system once more. Sensu has been plodding along nicely but I’m in this for the learnin’ so I’m looking for something more interesting that a major version bump and move to Golang. I’m thinking of giving Prometheus a spin to see how it’s changed over the last few years and as a first step I decided it was time to upgrade my local test bed Docker Compose Prometheus and add some bells and whistles. Read on →

Wed, Jun 24, 2020

sysadmin

Incident updates, interruptions and the 30 minute window

For most companies Incident Commander or Incident Manager is not a specific job, it’s a role you may take on when something has gone, often horribly, wrong and you need to quickly unite an adhoc group into a team to resolve it. The incident commander should be the point of contact, and source of truth, about your incident and to do that successfully they’ll need to be updated and kept informed about what’s happening. Read on →

Fri, Jun 12, 2020

sysadmin

Monitoring alerts and customer satisfaction surveys

Closing the loop on a monitoring alert is traditionally something that implicitly happens when the dashboard returns to its idyllic green state, the text massage returns a well deserved “Service: OK” or in more extreme cases the incident review is over and actions have been assigned. This however assumes the alert is working well and the operator understands why it woke them up and the value their involvement brings. In more fluid environments alerts can be incorrect, issues that do not require immediate attention and in the worst case ghost calls that mysteriously correct themselves just after you’ve woken up enough to find your MFA device. Read on →

Tue, May 12, 2020

sysadmin

The little tab in the middle

I have a tab that normally lives somewhere near the middle of my web browsers tab bar. Over the course of my day it faces constant pressure on each side. From ad hoc work tabs being opened by the pinned email and slack tab on its left and from proactive work based tabs from its right. I’ve learned I can tell how my week is going by where it is on the bar. Read on →

Fri, Feb 7, 2020

sysadmin

Low hanging BCP and DR scenarios

Just before Christmas I had to do some work on new business continuity plans (BCP) and disaster recovery (DR) documents. To help warm up and get myself in the right frame of mind I posted a few easy opening scenarios to Twitter for comment and I’ve decided to collect them back up and post here, in my external memory, for posterity. Each of these ideas should be considered the most generic and low hanging fruit of your plans. Read on →

2019

Wed, Nov 13, 2019

sysadmin

Magic Numbers and second guessing SLOs - why is 96% better than 95%?

I’ve had a half written draft of this post sitting in a folder for the last six months and I’ve not been able to shake the root cause so I’m going to publish it and see what the feedback teaches me. But first the heresy - Service Level Objectives make me uncomfortable. I have no issue with the idea that you need some form of measurement and tracking to ensure you’re maintaining an acceptable level of service but when reading posts on SLOs, or watching recorded conference sessions, the concept seems to imply some rigour and background process to determine the numbers to work towards that feels decoupled from any hard details and often comes across as either a guesstimate or just a Current Representation of Actual Percentages. Read on →

2018

Thu, Jun 7, 2018

sysadmin

Automatic datasource configuration with Grafana 5

When I first started my Prometheus experiments with docker-compose one of the most awkward parts of the process, especially to document, were the manual steps required to click around the Grafana dashboard in order to add the Prometheus datasource. Thanks to the wonderful people behind Grafana there has been a push in the newest major version, 5 at time of writing, to make Grafana easier to automate. And it really does pay off. Read on →

Tue, Mar 13, 2018

sysadmin

Viewing AlertManager Email Alerts via MailHog

After adding AlertManager to my Prometheus test stack in a previous post I spent some time triggering different failiure cases and generating test messages. While it’s slightly satisfying seeing rows change from green to red I soon wanted to actually send real alerts, with all their values somewhere I could easily view. My criteria were: must be easy to integrate with AlertManager must not require external network access must be easy to use from docker-compose should have as few moving parts as possible A few short web searches later I stumbled back onto a small server I’ve used for this in the past - MailHog. Read on →

Sat, Mar 10, 2018

sysadmin

Adding AlertManager to docker-compose Prometheus

What’s the use of monitoring if you can’t raise alerts? It’s half a solution at best and now I have basic monitoring working, as discussed in Prometheus experiments with docker-compose, it felt like it was time to add AlertManager, Prometheus often used partner in crime, so I can investigate raising, handling and resolving alerts. Unfortunately this turned out to be a lot harder than ‘just’ adding a basic exporter. Before we delve into the issues and how I worked around them in my implementation let’s see the result of all the work, adding a redis alert and forcing it to trigger. Read on →

Sun, Mar 4, 2018

sysadmin

Green system percentage vs user visible issues

How much of your system does your internal monitoring need to consider down before something is user visible? While there will always be the perfect chain of three or four things that can cripple a chunk of you customer visible infrastructure there are often a lot of low importance checks that will flare up and consume time and attention. But what’s the ratio? As a small thought experiment on one project I’ve recently started to leave a new, very simple four panel, Grafana dashboard open on a Raspberry PI driven monitor that shows the percentage of the internal monitoring checks that are currently in a successful state next to the number of user visible issues and incidents. Read on →

Sat, Feb 17, 2018

sysadmin

Prometheus experiments with docker-compose

As 2018 rolls along the time has come to rebuild parts of my homelab again. This time I’m looking at my monitoring and metrics setup, which is based on sensu and graphite, and planning some experiments and evaluations using Prometheus. In this post I’ll show how I’m setting up my tests and provide the [Prometheus experiments with docker-compose](https://github.com/deanwilson/docker-compose- prometheus) source code in case it makes your own experiments a little easier to run. Read on →

2016

Mon, Jan 25, 2016

sysadmin

Website monitoring with statuscake and terraform

As part of operation ‘make my infrastructure look like an adult operates it’ I needed to add some basic uptime/availability checks to a few simple sites. After some investigation I came up with three options, Pingdom, which I’d used before in production and was comfortable with, and two I’d not used in the past Uptime Robot and Status Cake. By coincidence I was also doing my quarterly check of which AWS resources Terraform supported and I noticed that a StatusCake Provider had recently been added so I decided to experiment with the two of them together. Read on →

2010

Wed, Nov 10, 2010

sysadmin

Zabbix GUIs and Automation

In the T-DOSE Zabbix talk, which I’m happy to say was both well presented and showed some interesting features, I got called out for a quote I made on Twitter (which just goes to show - you never know where what you said is going to show up and haunt you) about the relevance, and I’d say overemphasis, of the GUI to the zabbix monitoring system - and other monitoring systems in general. Read on →

Thu, Jun 3, 2010

sysadmin

Obese Provisioning - Antipattern

One antipattern I’m seeing with increasing frequency is that of obese (or fat, or bloated) system provisioning. It seems as common in people that are just getting used to having an automated provisioning system and are enthusiastic about its power as it is in longer term users who have added layer on layer of cruft to their host builder. The basic problem is that of adding too much work and intelligence to the actual provisioning stage. Read on →

Wed, Mar 31, 2010

sysadmin

HTTP Server Headers via Cucumber

One of my little side projects is moving an old, configured in little steps over a long period of time, website from apache 1.3 to a much more sensible apache 2.2 server. I’ve been thinking about how to get the most out of the testing I need to do for the move and so today I decided to do some yak shaving and write some simple regression tests, play with Cucumber Nagios, rspec matchers and write a little ruby. Read on →

Tue, Jan 12, 2010

sysadmin

Spreadsheets Vs Post-It Notes

I’m a fan of documentation, over the years I’ve ended up supporting more than one business critical system that has less documentation than you get from a cat /dev/null. The only downside, and I’ve been bitten by a couple of things like this over the last week is the case of the spreadsheet vs the post-it note - if you have a lovely, well formatted and information dense spreadsheet that says “A is 1” and when you get to the server room the switch has a post-it, in bad scrawl, that says “B is 2” which do you believe? Read on →

2009

Fri, Aug 28, 2009

sysadmin

Find and replace interview question

We’ve recently been searching for a junior sysadmin to join the team (and I’m very happy to say we’ve now succeeded) so as part of my day to day tasks I had to come up with a dozen simple questions to weed out the people that have never used anything but webmin (and there is a surprising number of them out there). One of the questions seemed to cause a lot of trouble in the general sense and tripped up the few who even made an attempt - Read on →

Thu, Aug 27, 2009

sysadmin

Large uptimes - a wonderful problem to have

When it comes to the list of problems ‘our uptimes are too high’ isn’t normally in the top five that sysadmins dread. While having a lengthy uptime used to be a boasting point it can also hide technical issues - such as kernel upgrades you’ve applied but not enabled (unless you’re running something special like ksplice), confidence gaps in high availability systems (when was the last time you did a fail over? Read on →

Wed, Aug 19, 2009

sysadmin

Testing A Production DNS Re-point

We recently consolidated a number of websites used by one of our brands back down to a sensible number (sensible being one). Which, while only a single action point on an email, turned out to be a large amount of DNS and apache vhost wrangling. In order to give myself a safety net, and an easy to check todo list, I decided to invest ten minutes in writing a small test script. Read on →

Fri, Jul 3, 2009

sysadmin

By Puppet or Package

At work we both build our own packages and use puppet to manage our servers. While the developers package up their work in the systems team we’ve moved more to deploying programs and their dependencies via Puppet. While it seems easier, and quicker, to do the pushing that way, at least for scripts, you lose the ability to track what’s responsible for putting each file on the system. I’m probably already modelling the more complex parts of what would be in a package (such as services and cronjobs) in the module and thanks to Puppet I’m probably doing it in quite a portable way. Read on →

Wed, Jun 3, 2009

sysadmin

It's been Critical for how long?

Nagios has a wonderful ‘duration’ column in its web interface that’s always bemused me. At what point does a check being in a warning, or even worse, a critical state stop being a problem worthy of head space and start being normal operating procedure? Checks can stay in an extended broken state for many reasons but they all seem to be symptoms of a larger problem. If it’s a small thing then are you getting enough time to do housekeeping? Read on →

Wed, Feb 4, 2009

sysadmin

Nagios check_http flaps

We recently had an odd one where the Nagios check_http check, which was both checking for the presence of a string in the response and that the page loaded in a certain time frame, went from reporting a ‘CRITICAL - string not found’ to a ‘HTTP WARNING: HTTP/1.1 200 OK’. My first thought, as this was a site pending migration, was that the URL had moved to a slower machine with the fixes released to it. Read on →

Tue, Feb 3, 2009

sysadmin

Splitting Syslogs by Facility

Logs are a wonderful thing. If done correctly they point out the source of all errors, show you what’s running slow and contain useful information on how your system is running. At every place I’ve ever worked they’ve been busy, full of odd one offs and too often overlooked. I’m going to be doing a fair bit of log processing next week so expect lots of little toolchain scripts like syslog-splitter. Read on →

Mon, Jan 19, 2009

sysadmin

My Pet Puppet Hate - Adding New Types

Now that chef is out and about people that accepted the massive improvement over all the existing host configuration managers that is Puppet will probably be casting a weary eye its way. I’ve got a little too much in puppet at my current position to look at moving for a while yet but now the competition is rising its time to get my boot in and point out what, for me, is the worst part of puppet; how difficult it is to add new types. Read on →

Mon, Jan 12, 2009

sysadmin

Penetration Testing in a Sentence

Penetration testing is tactical. It provides tangible, actionable information – Ivan Arce It’s been a while since I’ve been involved in pen testing but the above quote from Ivan is perfect and its meaning all too often overlooked. When you invest the time in something like pen testing or performance tuning you should always come away with a list of actionable tasks. By doing this you ensure the work wasn’t pointless (or if it was avoid repeating the mistake) and have something you can present to stake holders to get buy in for the next time. Read on →

Wed, Jan 7, 2009

sysadmin

Which Zones Have a Specified Subdomain? - DNS Delvings (1)

It’s been another day of many DNS changes and while the work itself has been amazingly dull, life draining, scut work at least one positive thing’s come out of it - my appreciation for the Net::DNS perl module has grown. While it’s possible to do nearly anything DNS query related with the dig command it’s a lot easier to extract the data and reuse certain fields if you have access to a decent data structure rather than grepping bits of text out. Read on →

2008

Tue, Aug 12, 2008

sysadmin

The Rules of Releases

You’ve gathered the requirements, written the code, debugged it, received the new requirements, rewritten the code, got more change requests, reached a ‘compromise’ with QA (and hidden the bodies) and now you want to have the sysadmins do the release. Don’t be like everyone else - when it comes to releases too many people fail at the last mile and make obvious mistakes. In an attempt to save myself some pain (and have something to point co-workers at) here are some of the software release principles that I hold dear. Read on →

Tue, Jul 8, 2008

sysadmin

More Memory Than Sense

My recent bugbear is - servers with inaccessible memory. You go and spec a nice new server with say 8Gb of RAM (a little box), you install Debian, you start adding applications to the machine and then a couple of months later some anal sysadmin comes along, does a free -m and mutters about under-specced virtualization servers when he sees total used free shared buffers cached Mem: 3287 225 3062 0 24 149 For those of you not paying attention - the machine isn’t using over half of it’s memory. Read on →

2007

Sat, Apr 21, 2007

sysadmin

Deferring Defects - Autonomics

Autonomics refer to the ability of computer systems to be self-managing. – autonomics.ca Here’s one that has been bothering me. Suppose you have a recurring problem that your “autonomic solution” can handle every time it occurs without any one knowing. At what point does the fact there is a treatable issue propagate up to a real person? While an automatic “fix and tell me later” approach helps change your work from fire fighting to planned tasks what classifies a temporary problem as being important enough to warrant you investigating it? Read on →

Sat, Apr 21, 2007

sysadmin

Handling Requests: Three Simple Rules

I’m a sysadmin, half my working life seems to be spent handling other peoples requests (which is why I’m trying to move over to infrastructure work - where I can hopefully concentrate on something for three whole minutes). While chatting with a junior admin at a tech talk in the week the following three tips came up: Use a ticketing system. This one comes up a lot but it’s true, never dropping someones request is well worth the time spent setting it up. Read on →

Tue, Apr 17, 2007

sysadmin

No one likes a whinger - The systems fight back

After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. Read on →

Wed, Mar 28, 2007

sysadmin

Bonded | Teamed Network Interface Challenge

Here is another one for the sysadmins in the audience: How … … many of your servers have multiple network ports in the back? … many of them have bonding (teaming for the Windows people) enabled? … do you know when one interface goes down if the machine stays connected? … long does it take for you to be notified? … do you know if they start flapping? … many have their bonded interfaces plugged in to different switches? Read on →

Mon, Mar 26, 2007

sysadmin

Monolithic Config Files Considered Harmful^WAwkward to Manage

This came up in conversation with a developer at the Google OpenSource Jam so I thought I’d mention it while it is fresh in my mind (update: at which point I forgot to move it to the published directory. Doh). Breaking up config files isn’t done just to annoy people, it’s done to make automated and mass management easier. A solid practical example is the Debian Apache configs. Historically most distros (and too many current ones) used a single config file for Apache. Read on →

Sat, Mar 10, 2007

sysadmin

Importance Levels - A Simple Example

When you’re first introduced to an environment you’ll have the ever fun task of working out which machines should get the most time; and that order seldom matches which machines actually need the most attention. To help me prioritise I’ve worked out a simple importance rating system to show where I spend my time. Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I’ve got in my todo list for that level. Read on →

Sat, Mar 10, 2007

sysadmin

The Cron Commandments - part 1

Although it’s a rare Unix machine that doesn’t run at least a couple of custom cronjobs it’s an even more special snowflake that does them properly. Below are some of the more common problems I’ve seen and my thoughts on them. Always use a script, never a bare command line. A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says "I didn't really think this through" and " Read on →

Fri, Mar 9, 2007

sysadmin

Disk Delving - 2 Good Papers and a Blog

“The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.” – StorageMojo - Google’s Disk Failure Experience There have been two excellent papers on disk drive failures released recently, the Dugg and Dotted Google paper - Failure trends in a large disk drive population (warning: PDF) and the also excellent but less hyped Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Read on →

Fri, Mar 9, 2007

sysadmin

Sysadmin Challenge - Disk Usage

Here’s one for the sysadmins in the crowd; if you were asked to show the following how long would it take you to gather the information? Which of your file systems have the fastest growth rate? Which are the most under-utilised? Which haven't changed by more than 5% over the last month? If you use Nagios you can cheat and work out the full drive size from the free space and percentage used reported by the disk checks, but that’s… icky. Read on →

2005

Sat, Jan 15, 2005

sysadmin

The Hidden Curse of High Uptime

A number of Unix/Linux people seem to pride themselves on obtaining the highest uptime they can. While this may seem like a little harmless fun, in a production environment (which are mostly fun-free places), it can hide a number of problems that will later become major issues. At some point the machine will have to come down and face a power off or reboot, and then it’s expected to come back up, and this is where the problems can start. Read on →

2004

Sun, Dec 5, 2004

sysadmin

Gigabit Ethernet? Bah! I need REAL speed!

Although it actually sounds pretty fast, when you actually start benchmarking it, Gigabit Ethernet isn’t quite as good a solution as you’d think. As more and more commercial deployments move to using SANs and NAS for online storage and backups it’s increasingly easy to saturate existing LANs. One possible solution as people start to look at 10 and 100Gbps networks is FireEngine (PDF), a set of architecture changes and improvements for Solaris 10. Read on →

Sat, Oct 9, 2004

sysadmin

WS-Management, an SNMP Replacement?

I hate to jump on any bandwagon that starts at Slashdot, although even a broken clock is right twice a day, but I find myself agreeing with a number of the Slashdot comments made about the new WS-Management spec. Firstly, and most importantly, SNMP is still the most widely used management protocol in production. Secondly it has survived the invention of a number of replacements, WBEM and CIM spring to mind as standards chosen to replace a lot of the functionality it provides; oddly enough those specs were also backed by Microsoft and Sun. Read on →

2003

Mon, Jun 23, 2003

sysadmin

Auditing the Three Finger Salute

"Its only running a single service, we’re fully patched and it has a local firewall that denies by default.“ "What happens if i do Ctrl-Alt-Delete?“ <h3>Introduction</h3> One of the basic premises of computer security is that it's almost impossible to fully secure any machine to which an attacker has physical access. While we cannot cover all eventualities, we can make some simple changes to catch any use of the more blatant avenues of abuse. Read on →