1. Describing the root cause and what you failed at.
2. Blame the stuff you are using / other people (clouds you use)
3. just says nothing and try to forget what happened.
No need to blame it on anything, not necessarily you need to go into the fine details either.
You can simply make a statement that goes something like this:
"We've completed our investigation of the outage and we found that it was caused due to both technical and procedural errors in the manner in which we deploy our code and monitor the environment.
We gathered all the information we require and have made improvements based on it that would help to prevent these issues and other issues with similar causality from occurring.
While we do apologize for any inconvenience that the outage may have caused we do want to stress that because of the lessons learned from it our service would grow to be more robust and reliable in the future."
That's it, simple even if generic, having to read 3 pages of technical details isn't really helpful to anyone, if anything the more "suspicious" people might see that as an attempt to mask the real cause of the issues.
But overall when you go into specific what you also give is for people the ability to focus their frustrations and disapproval on a specific subject which is never good.
After reading this what I "feel" at first glance is that the the fault lies in the engineers that monitored the environment, so the engineers are incapable of performing their duties, now i feel like the hiring and management processes in that company are not working well if they let "unqualified" engineers in.
This is how how a minor outage now blows into a specific complaint or negative bias towards a company and you can easily avoid it by giving enough "reassuring" information but not enough for anyone to actually sink their teeth at.
Overall a generic positive statements is more likely to be accepted as well it sucks but shit melts down sometimes and sometimes people make mistake.
A a more technical statement might be accepted as "well why did you hire bob in the first place?" or "why fuck are you using this_framework_i_dont_like?".
I think 1 is correct, but it's about the level of resolution into the issue.
"Asana had an outage for 45 minutes yesterday. This was due to an issue with a deploy that was pushed the night prior. We apologize for the inconvenience and are undertaking a thorough review of processes to ensure that similar events don't occur in the future. Please be assured everything is back in working order now. Thank you for your patience and continued patronage."
Big detailed postmortems like this should remain internal documents unless they describe a complex or rare technical failure, news and/or discussion of which will actually benefit the larger community.
It's maybe a level of details thing. "A bad deploy went unnoticed, causing a cascading failure. We identified how that happened and have new checks in place to prevent it in the future."
Two lines, with the same information someone not very technically literate would understand from the OP. I agree with being transparent, but I also believe in not unnecessarily scaring and/or confusing customers, either.
(Pretty soon they'll just start outting individual engineers...)
The fact that you think an engineer can be "outed" is a culture problem.
The process failed the engineer. Testing, deployment, and monitoring infrastructure was not up to the task of supporting human beings. That it happened to be triggered by engineer X instead of engineer Y is entirely coincidental.
The audience of the post mortem matters. When I see the two line summary, I have no idea whether that's a CYA whitewash, or a sincere part of a process of improvement. When I see the full PM, it builds more trust.
If you're not an engineer capable of understanding the details, it may have a different effect. And if you're part of a corporate culture of politics, shaming, and status chasing, it must feel totally alien.
I think you do need to at least acknowledge the problem. With a clear non-technical explanation of the problem in the first paragraph. The rest should go into real technical details of the result of the investigation, not an investigation itself.
Why is it confusing? I did not find it confusing. I like such excruciatingly detailed postmortem analyses, they make for great reading and my respect for the company that does this is increased by reading these.
1. Describing the root cause and what you failed at. 2. Blame the stuff you are using / other people (clouds you use) 3. just says nothing and try to forget what happened.
What do you think is best?