Lessons learned from two decades of Site Reliability Engineering

I eat words@group.lt · 1 year ago

Lessons learned from two decades of Site Reliability Engineering

I eat words@group.lt · 5 months ago

Reread today again, with some highlights:

Lessons Learned from Twenty Years of Site Reliability Engineering

Metadata

Author: sre.google
Category: article
URL: https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned/

Highlights

The riskiness of a mitigation should scale with the severity of the outage

We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time.

Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we’ve doubled down on testing.

We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure.

A “Big Red Button” is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever’s happening.

Unit tests alone are not enough - integration testing is also needed

This lesson was learned during a Calendar outage in which our testing didn’t follow the same path as real use, resulting in plenty of testing… that didn’t help us assess how a change would perform in reality.

Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services… relying on these Google services was, in retrospect, kind of a bad call.

It’s easy to think of availability as either “fully up” or “fully down” … but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience.

This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.

A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying “What Ifs”, for example, “What if part of your network connectivity gets shut down unexpectedly?”.

In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there’s a clear signal that a particular failure is occurring, then why can’t that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.

Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.

Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.

Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.