97 things every SRE should know - Part 01

A few people I follow on twitter mentioned they’d contributed to
97 Things Every SRE Should Know. It’s a book full of short, 1-3 page chapters, focused on topics dear to an SREs heart. So i had no choice but to buy it. In an attempt to be more deliberate with my reading and what I’ve retained from the book I’ve decided to create some reading notes for future me. This post is broken down into a section per chapter.

Chapter 42 - Why I Hate Our Playbooks

There are some great quotes:

  • Any playbook that can describe the exact steps to resolve an exact circumstance should be an automated script instead.

  • We escalate to humans for a complex response, not a fast response.

It also provides some good meta guidance on playbooks:

Ideally, a playbook should only contain:

  • Why do I care? Severity and qualification of the user-visible impact.
  • What can I look at? Consoles, logs, and inspection tools.
  • What can I do? Mitigation tooling.

As well as an example playbook work flow:

|-> 1 Identify issue
|   2 Debug
|   3 Add alerts
|   4 Write documentation
|   |-> 5 Automate resolution
|   |-< 6 Update documentation
|<- 7 Have a different problem

My own views on playbooks are that you get out what you put in, and they are often a last minute band aid rather than a full part of the product. They should have a life cycle. A time to be useful and a time to die be automated away. The trick is in knowing what stage one’s at.

One way to discover this is by capturing usage and relevance information. A simple thumbs up thumbs down on each page gets you started but ideally you’d track a little more. Has the page ever been read? If so how long ago? Was it actually used? or did it just get opened and immediately closed? When was it last reviewed? A simple “helped” / “didn’t help” checkbox and a comment box at the bottom of page can help you get started.

It’s also important any feedback you capture is as frictionless as possible. Don’t make people change tab to a doc review system for example but embed it in the document itself. Anything that adds friction will stop people responding. I also think it’s better to gather some data than none so don’t present a 4 page 50 question survey.

I found this chapter to be a great example on how you can approach playbooks in a more consistent way.

Chapter 12 - The Importance of a Management Interface - Salim Virji

This chapter provides a high level overview of why you need to keep the serving of user traffic serving separate from the control and administration traffic and data flows. In short, it’s so you can actually affect the system when events like high client requests happen. Otherwise, you sit in the same queue as everyone else, waiting for the fix to be processed.

There are some great quotes:

  • During an outage, you care more about being able to control the system than about the system answering all user-facing requests.
  • To implement this separation for software that’s already built, such as third-party applications, you may need to add a separate service that, like a sidecar, attaches to the core software and, through a common interface such as an HTTP server, provides an endpoint for the administrative API.

You often learn of the merits of this separation approach when your system becomes popular, or as it’s known, “The worst possible moment. It’s interesting to compare this isolation approach against systems like RDBMs that reserve connections for root but still have the same connection point.

If you want to know more about control planes AWS re:Invent 2018: Close Loops & Opening Minds: How to Take Control of Systems, Big & Small ARC337 by @colmmacc was a great watch.

Chapter 63 - Effecting SRE Cultural Changes in Enterprises - Vanessa Yiu

For a short piece pretty much every paragraph had something i wanted to quote or highlight. If you’re introducing SRE to a company I suspect this one is worth a quarterly re-read until the practices are established.

It dives right in with great advice if you’re just beginning your journey to introduce SRE:

“To avoid this, focus initially on the few most critical behaviours to adapt. In other words, find the key blockers to successful implementation of SRE at your workplace. If a shared responsibility model does not exist between developers and SRE, for instance, then perhaps start here, because that is foundational to getting SRE right."

I still struggle with this, it’s reassuring to know other people consider it essential.

Some of the quotables that most hit home with me are:

  • “Bringing on SRE means overcoming inertia and requiring a substantial investment of time to educate as well as continuous reinforcement of practices and behaviours." As always it’s not just about the technology.
  • “It is important to identify where the gaps are and then build a clear roadmap to lay the required foundations first."
  • “Providing transparency and identifying the correct incentives are critical to the success of any large-scale change program. People need to see and believe the value for changes to stick. Be thoughtful about which results matter and which indicators reflect successful changes in behaviour."

I found myself nodding along to pretty much this entire chapter as I’ve now lived through it twice at the same organisation and this is a great summary of my experience too. Some of its points might seem clichéd but there’s a reason they became cliches.

79 - Why Training Matters to an SRE Practice and SRE Matters to Your Training Program - Jennifer Petoff

It’d be criminal of me not to bring one of the acronyms used in this chapter to the forefront of my notes - “So where should you begin? I have one acronym for you: ASSBAT”

A great mnemonic for an important concept. “ASSBAT stands for a student should be able to. ASSBATs are learning objectives focused on behaviours you want to drive and observe. Understand the $foo service is a bad ASSBAT” by focusing on more user and task specific chunks you can build competency and confidence on the things you value and will actually be useful to the engineers when they actually start with the system. While having a deep understanding of your data flows is important knowing which command runs the loadshedder and which dashboard to review is more immediately useful as a canned response at 3am.

32 - Bootstrapping SRE in Enterprises - Vanessa Yiu

Enterprises are internally very different beasts to startups and how you try to coerce them into changing needs a different approach. This chapter helps frame the subject well and if I’d highlighted all the sections I nodded along to the page would have been a magnificent rainbow.

Some excellent quotables include:

  • “Standalone systems are rare in enterprises - the service you operate will probably depend on a range of other upstream and downstream services, so your success is directly correlated with your ability to navigate, influence, and deliver across the organisation."
  • “services with higher service level maturity likely have well-established operational practices in place already but perhaps more instability due to usage growth (i.e., capacity constraints) and increasing complexity over time compared to newer services."
  • “Take the time to do research; interview product managers and on-call engineers to understand common challenges and leverage data sets (e.g., problem management database, incident postmortems) where available to confirm trends and remove recency bias."
  • “At the outset, focus on solving a few key issues where SRE can demonstrate the most impact in the short term. These early successes build trust with the stakeholders mentioned earlier, demonstrate how SRE adds value to those unfamiliar with the discipline”

Vanessa also states “Check in regularly on progress, more frequently (e.g., weekly) at the start of any engagement until things reach steady state." I’d like to add, if you’re being checked in on, assume good intent and that people care how things are going rather than it being a sign of distrust. While you may be responsible for the work getting done the stakeholders accountable for it will need a little time to build trust before they can increase the scope of autonomy so upfront communication and building rapport and showing successful delivery pays off over time for both sides.