As a reducio ad absurdum, suppose that Amazon was secretive beyond all dreams of a CIA director, but never ever failed. Would people care about their communications strategy?
Likewise, if they could telepathically beam status updates during a pattern of approximately half hourly failures, would they be feted or avoided?
The updates probably have a lawerly feel because when the engineers are asked for a status update by management, they get a "piss off, we're working on it, don't bug me". Indeed a smart manager would refrain from indulging the instinct to constantly interrupt work to receive status updates. Perhaps Amazon has such managers.
> As a reducio ad absurdum, suppose that Amazon was secretive beyond all dreams of a CIA director, but never ever failed. Would people care about their communications strategy?
This is not the same. No one cares about active communication, when things go as expected. However, failure/unexpected outcomes are supposed to be communicated properly.
> The updates probably have a lawerly feel because when the engineers are asked for a status update by management, they get a "piss off, we're working on it, don't bug me".
The relationship between a boss and an employee is different than a customer and a company. More so, A boss asking for update, is not the same analogy as company providing an update.
When I(an employee, and most of my coworkers) work on some bug fix, makes things inconvenient others (including my bosses), I try to communicate the status of the issue and progress with my efforts actively, even if they dont ask for it, because this is what I believe is the professional thing to do.
OP's complaint is that he's not getting a ringside seat. There's a difference between asking for minute technical updates (where I made the analogy with interfering management) and getting general, high-level status reports.
I'm sorry but have you ever worked in a large enterprise IT organization?
The JOB of the manager is to stay abreast of the issues his team are working on and provide REGULAR updates as to progress. Even if that means "We are still working on problem X, expected resolution is Y"
I managed IT for an entire division in Lockheed. You think that a company continually against the onslaught of Chinese hackers employs "smart managers that refrain from indulging ... to receive status updates"
WTF are you thinking?
The err in logic on your part is that, sure, it may not be best to provide minute by minute to your customers - but be it known, you MUST provide accurate and regular status internally when you have one of the HIGHEST VISIBLE SERVICES of the freaking Internet.
So, that would tell us that there was likely either internal chaos, a group with no idea what the root cause was OR a failure of the externally facing IT communications channels -- but to say that it is good management to stay effectively out of the loop is false.
Now, Ill give you some credit in your statement: You did say "constantly interrupt" -- but if your interactions with your team are such that your inclusion in situational awareness is equivalent to an interruption, then there is something much more wrong with the structure of your team/organisation/ability to manage.
> The JOB of the manager is to stay abreast of the issues his team are working on and provide REGULAR updates as to progress. Even if that means "We are still working on problem X, expected resolution is Y"
This is pretty much what Amazon have provided, that OP's link complained about. Management-level updates: "we're working on it", not a line by line log of technical hypotheses.
I would hope that in a critical situation they'd realise most customers want the service restored, not to know about why it's broken in exhaustive detail.
I agree - they could have definitely handled customer communication much better.
Until we get more information on the exact details of the EBS failure (it may be available, let me go look, nothing yet.), I would guess that some of the following is true:
We know that this was a "network error" that caused "EBS to begin to replicate"
What We don't know is if this was the result of:
Bug in unknown device layer that caused network problems
Bug in Network gear that caused EBS device problems
Network routing problem due to fat-fingered config change
Unforeseen design flaw brought to light via new spiffy routing change.
What does appear obvious, is that it is either still unknown/understood on the part of amazon -- or it is so severe that we are not being given any information because:
Either they have all hands on deck and cant give good updates
They really don't know how to handle communications
They have some serious damage control to figure out (this could be an attack of an exploit on their system and they cant let word get out yet)
There are a lot of reasons for the poor communication from them - but my bet is just that they had some device/firmware build/configuration explode and they don't know how to quite fix it just yet - or their architecture was so dependent on that particular thing that failed that they have to figure out a really big/hard problem really fast.
Likewise, if they could telepathically beam status updates during a pattern of approximately half hourly failures, would they be feted or avoided?
The updates probably have a lawerly feel because when the engineers are asked for a status update by management, they get a "piss off, we're working on it, don't bug me". Indeed a smart manager would refrain from indulging the instinct to constantly interrupt work to receive status updates. Perhaps Amazon has such managers.