Building an online community around learning from incidents (2019)

closeparen · on June 21, 2020

I've noticed an interesting phenomenon in postmortem review meetings: whenever someone suggests an additional manual step or bureaucratic approval that might have helped, this suggestion is always accepted. When it's patently ridiculous, the action item will just languish in a queue until it's purged in the next ticketing system migration. But in the moment, the whole room's sentiment is "yes of course, obviously we should have been doing that, how irresponsible of us." No one wants to talk costs, proportionality, error budget, etc. I've stopped bringing it up for fear of seeming reckless. But it's not like we're the space program.

Anyone seen this in their postmortem culture? How did you deal with it?

asymptotic · on June 21, 2020

There is a quote from Jeff Bezos: "Good intentions never work, you need good mechanisms to make anything happen.". Hence at Amazon (ideally) any suggestion that a manual step or bureaucratic approval could improve a process is always naturally followed up by “But what happens when good intentions fail? We need a plan for a mechanism”. At least this is the theory, and how Amazon and AWS avoid accruing manual steps with no plan to automate them.

8organicbits · on June 21, 2020

I think that can be ok. If a slow bureaucratic process is created, some metric for how fast the team can perform will take a hit. As the team looks at ways to keep cycle times high, they will automate it.

Or when the manual step itself fails, people realize that an automated test is better and the next post mortem action plan is to automate.

But, you can raise those concerns. They are valid and in many cases I think it's worth the discussion.

sokoloff · on June 21, 2020

“Beware the green tape.”

No one likes red tape, and I liken the allure of “being more professional in all our operations” to seeking to add green tape to the system. It turns out that if you add green tape every time, much of that turns into red tape. Just asking the question is shorthand for “is the benefit of this newly proposed tape greater than the cost?”

On a related topic, I think few companies should design to ensure greater than 99.95% availability. Giving yourself that generous of an error budget (sounds irresponsible, doesn’t it?) allows you a concrete target to measure against that isn’t 100.000%. If you’re consistently hitting 99.97%, you can push back against some of the stickier pieces of new tape.

sokoloff · on June 20, 2020

Read John Allspaw’s doc (the first reference in the article).

It’s absolutely critical to create a culture that when Kirin makes an error, they have no fear and total incentive to explain what they did, when they did it, why they thought it was the right course of action, what happened when they did it, etc.