> The JOB of the manager is to stay abreast of the issues his team are working on and provide REGULAR updates as to progress. Even if that means "We are still working on problem X, expected resolution is Y"
This is pretty much what Amazon have provided, that OP's link complained about. Management-level updates: "we're working on it", not a line by line log of technical hypotheses.
I would hope that in a critical situation they'd realise most customers want the service restored, not to know about why it's broken in exhaustive detail.
I agree - they could have definitely handled customer communication much better.
Until we get more information on the exact details of the EBS failure (it may be available, let me go look, nothing yet.), I would guess that some of the following is true:
We know that this was a "network error" that caused "EBS to begin to replicate"
What We don't know is if this was the result of:
Bug in unknown device layer that caused network problems
Bug in Network gear that caused EBS device problems
Network routing problem due to fat-fingered config change
Unforeseen design flaw brought to light via new spiffy routing change.
What does appear obvious, is that it is either still unknown/understood on the part of amazon -- or it is so severe that we are not being given any information because:
Either they have all hands on deck and cant give good updates
They really don't know how to handle communications
They have some serious damage control to figure out (this could be an attack of an exploit on their system and they cant let word get out yet)
There are a lot of reasons for the poor communication from them - but my bet is just that they had some device/firmware build/configuration explode and they don't know how to quite fix it just yet - or their architecture was so dependent on that particular thing that failed that they have to figure out a really big/hard problem really fast.
This is pretty much what Amazon have provided, that OP's link complained about. Management-level updates: "we're working on it", not a line by line log of technical hypotheses.
I would hope that in a critical situation they'd realise most customers want the service restored, not to know about why it's broken in exhaustive detail.
edit: removed snark.