SLO Adoption and Usage in SRE - Reading Notes - UnixDaemon: In search of (a) life

I recently read SLO Adoption and Usage in SRE, a free book of two halves. The first provides a brief introduction to SLIs, SLOs and Error Budgets that could be given to an impatient but interested co-workers. The second part is an analysis of the responses from the ‘SLO Adoption and Usage in SRE’ survey. If you like the DORA State of DevOps Reports you’ll also enjoy this.

Summary

“SRE is an emerging IT Service Management framework” and should be treated in the same way as ITIL, distrusted but pillaged for the good bits.

“Nearly 54% of the respondents do not currently use SLOs, but half of those respondants plan to do so at some point” - This is a good indication of how different companies see the role and responsibility of Site Reliability Engineering. Even the self-selecting crowd that responded to this survey are not all in.

“Of the 46% of companies that use SLOs, 40% have had them in place for one year or less.” This combination of immaturity and enthusiasm probably explains why every time $work mentions we’re adopting SLIs / SLOs we’re inundated with questions. Everyone’s in the new boat together.

There’s no mention of anyone wrapping run cost behind an SLO. I think it’d be interesting to have it as an additional tracked dimension.

The Calculus of Service Availability is linked and is a good article.

Chapter 1

This chapter provides a good introduction to the language and an overview of why you should be interested in SLOs in a handful of pages.

“SLIs tell you good events from bad events. SLOs tell you what proportion of good/bad events is acceptable”

Your SLOs should be more restrictive than you SLAs to provide a warning buffer in which you can take action.

“SLOs establish thresholds for acceptable levels of reliability that SREs must protect and maintain—this is their primary responsibility and drives their priorities and daily tasks. Defending SLOs is also a core competency of Site Reliability Engineers.” to me this is why SRE is an optimisation of the operations role and may not be suitable for every organisation. Someone still has to do the other stuff.

Chapter 2 - Summary of the data

The bulk of this chapter explains the data and how the responses are broken down. If you want to compare your own practices against the respondents this is the chapter for you.

“We surveyed a cohort of professionals with an interest in development and operations topics. We received responses from 572 industry professionals around the world.” I wonder how many of those were from Google itself.

“The majority of the survey respondents work in North America (42%) and Europe (35%).”

Relevant to me is that about 7% of the respondants work in Government. As a larger, older, set of organisations they probably already have something in this space even if it’s older and not as optimised for modern practises.

People with the title of “Site Reliability Engineers” represented 7% of the respondants.

Larger companies are more likely to have SREs as they can afford to hire them.

SRE practices are not uniformly adopted. Only 40% of respondants do blameless postmortems and just 34% develop SLOs (page 18). If you’re looking to change from being an SRE in one company to another you should prepare some focused questions for the “Do you have any questions for us?” part of the interview.

Despite the common mantra that SLOs should be devised with management involvement “70% selected a combination of options that did not include ‘Chosen by management.'”

“90% of respondents take some action when they miss SLO targets.” I assume the other 10% live in blissful ignorance. (page 26)

12% of people review their SLOs after exceeding their error budget. I’d love to know how many of those reviews result in lowering the SLO. (graph page 27)

Durability is the least used factor in SLOs. I assume this is because it’s really hard to do well.

Chapter 3

If you find yourself getting bogged down in defining SLOs the book advises to ask three questions:

Who are my users?
What do they want/expect from the system?
What level of reliability will they be happy with?

Building good SLOs - CRE life lessons is linked to in the foot notes.

It’s fine to start your SLOs with numbers that represent the currently running state but you should review it to ensure you understand the why of those numbers and they are actually bringing you some value.

When working with latency SLOs you will often have multiple targets, one to capture the bulk of the requests and one for the long tail. (page 38)

Chapter 4

“For example, if your website loads slowly for users, they do not care whether your database went down or your load balancer sent requests to bad backends.” I see this a lot in literature about SLOs but no one else discusses where you should be surfacing those in conjunction with this approach. Some thing cares about those metrics. Do you just gather them and make it the operators problem when an issue comes up?

The SLI Specifications section on page 47-48 is useful when writing your first SLIs. It will help focus you on what kind of information you need to consider.

“Which of the requests the system serves are valid for the SLI?” is a very overlooked aspect. You probably don’t want your health check requests being considered in the same SLI as customer latency. (p49)

If you’re bootstrapping SLIs request logs are a good starting point - “Request logs are also well-suited for organizations that are establishing SLOs for the first time. You can often process request logs retroactively to backfill SLI data. Use the historical performance data to determine a baseline level of performance from which you can derive an SLI.”

“Rolling windows better align with the experience of your users. If implementing a rolling window, define the period as an integral number of weeks so that it always contains the same number of weekends.” (page 55)

“The SLO Adoption and Usage Survey finds that the majority (54%) of respondents with SLOs do not regularly reevaluate them and that 9% never review them.” - how heavily can you be relying on something you never review?

Chapter 5 - Error Budgets

This is the part of the book I am most sceptical about. The chapter is very well written but it’s such a cultural change I’m not as bought in to this aspect of the process.

A worked example - “if your SLO says that 99.9% of requests should be successful in a given quarter, your error budget is 0.1%. If you look at this in terms of time, this equates to 43 minutes of downtime per month.” - how quickly does your monitoring flag an issue? What does your on-call policy say your response time to a call should be? Could you ever have more than one instance of downtime a month if you had these numbers and your current systems?

This should be in 32 point bold and in red: “Error budgets align product teams and SREs on incentives, which removes tensions between teams and makes it easier to decide the rate of releases versus production risk. However, you will realize this alignment on incentives only if all teams and leadership support an error budget culture.” (page 62)

You should have an error budget policy: “Specifically outline the actions that should be taken if the error budget is over spent, and provide a clear escalation path if there is a disagreement about the calculation or about whether to take the agreed upon actions.” - I thought it meant move all effort to resilience #snark

Chapter 6 - Case Studies

The first case study has a good section on what to do when the error budget is breached. It’s a big quote but worth reading - “When services meet their SLOs, they continue to release in two week sprints. When an SLI indicates that the SLO has not been met and the service exhausts or comes close to exhausting the error budget, the team implements a feature freeze for 30 days. In those 30 days, the solutions team has specific policies that delineate whether the feature freeze applies to the entire service or just to a particular component. Solutions teams must dedicate resources to determine the cause of the issue and decide how to improve reliability. Only reliability fixes are promoted to production and turned out in those 30 days. The 30-day feature freeze policy allows the team to address additional reliability issues from the backlog to avoid future freezes. This backlog can include fixes to software, increased telemetry (SLIs) to improve serviceability, or improvements to automation to reduce recovery time.” (page 70)

It’s a little bit of a downer note to end on but “The SLO Adoption and Usage Survey finds that, although respondents have adopted many SRE best practices, implementing SLOs and SLIs are the SRE practices they are least likely to adopt.” is an important thing to note. You don’t have to go all in to SRE to see some of the benefits and you should evaluate each practice to see which ones make the most sense to adopt at each point.

Closing notes

This book has two audiences - for the impatient that want to understand the basics of SLOs / Error Budgets it provides a very clear introduction to the practice, some of the mechanics and the benefits it can bring. Other parts of the book are more interesting to people already involved in SRE and provide some rare insights into what other places are doing with the role. For such a short, and free, book I found it well worth the reading time.