My company is in the same boat, so I'm definitely sympathetic, but I think the criticism is a bit harsh. First of all, transparency is easy for a startup because no one is listening (the majority of your prospective customers haven't heard of you yet) and you don't have that much to lose anyway. It also pays dividends because early adopters are the people who appreciate that transparency the most. When you're a big corporation the economics change, and lawyers and PR call the shots; it's not because they're disingenuous, it's just because they have a lot more to lose.
The updates from Amazon have been okay in my opinion. They could have been better, sure, but do we want the engineers working on this to stop every 20 minutes to write up obscure details that don't yet paint a cohesive picture? The bottom line is better updates won't help them fix the problem faster, and actually could distract from the resolution.
If they don't post any more details even after everything's back up and running, then there will be something to complain about.
> do we want the engineers working on this to stop every 20 minutes to write up obscure details that don't yet paint a cohesive picture?
In a company the size of Amazon, I'd think they could afford to hire at least one person who had all the knowledge of a programmer, but who did no coding of his/her own—whose sole purpose would instead be to watch over the shoulders of the people doing the programming at times like this, and push-notify management and/or PR with the progress and difficulties in an accessible fashion. A technical stenographer, or code-bard, if you will.
> In a company the size of Amazon, I'd think they could afford to hire at least one person who had all the knowledge of a programmer, but who did no coding of his/her own
You're assuming such a person could actually produce any answers. I think the big problem in situations like this is that internally they actually don't know for sure what is going wrong. You could probably talk to 5 different engineers and get 5 answers with viable theories. They have ideas, good ideas, but they aren't 100%, the answers are being determined by experiments. I think having a person running around communicating lots of possibly incorrect ideas about what is going on could easily make things worse.
One big issue, I think, is that Amazon's own status page appears to have been incorrect - it's had a lot of green ticks against things that people have reported as being completely down. There's going to need to be some follow up on this sub-issue down the track on top of everything else.
I really don't envy Amazon at all in this situation. In many ways they are doing things with AWS that nobody has ever done before, and they are probably hitting classes of problem that nobody ever anticipated. Of course, they promised more than they could deliver so it's their fault in the end, but I still think the nature of what they are doing is under appreciated.
You'd be wrong. Even though it is a huge company, budgets are managed in really small groups, and no manager would "waste" a headcount when they only have a handful to begin with.
I've worked at a large FI where there is an entire team devoted solely to this task--"communicate with/between stakeholders and techies when something doesn't work." I have no idea what they do--or are qualified to do--if otherwise. They are 10 deep with 3 middle managers and a senior, out of a 100ish department devoted to the FI's .com and online services front-ends. There is another team for customer support.
Although definitely overkill, and understandably in an organization that doesn't particularly value efficiency either, it goes some ways to illustrate the importance of these communication channels in mission-critical environments.
Such a person, though, if kept by anyone, would be budgeted under a high-ranking executive or "internal affairs" group that would want to know what's going on in order to optimize their requests—not by some group wishing to publish updates and to thereby be optimized. I'd imagine that this person would be constantly reassigned to whichever project was currently experiencing the most emergency-like conditions (if they weren't given free-reign to find these projects—because, hell, they're in the absolute best position to hear the word-on-the-street of new emergencies.)
I have, and you're right it doesn't seem to help. But funnily enough, I also found with so many people watching over my shoulders that I was forced to think cleanly and logically about the process I was going through to diagnose the faults and bring the systems back online. Since I was explaining to a group of shoulder-surfing engineers and managers each step of the process, I managed to get a clarity which I may not have had if left to my own devices. It's like the "Inflatable Engineer" scenario, except I had several real ones available instead :-)
"It also pays dividends because early adopters are the people who appreciate that transparency the most. When you're a big corporation the economics change, and lawyers and PR call the shots; it's not because they're disingenuous, it's just because they have a lot more to lose."
Ultimately I think "legacy" PR based on control of the message is obsolete.
I wonder if everyone here is using the same definition of "working on"?
Perhaps if the engineers had 20 minutely updates with each other, the coherent picture would emerge more quickly?
Hopefully they aren't doing the same "jiggle it until it works" that is being discussed elsewhere on HN right now; so, what are they doing, precisely?
I know we don't know because they aren't saying, but what are people who work in teams on huge important systems doing generally when they fail? What should they be doing?
Since Amazon is large and established enough to take public relations and marketing seriously it's reasonable to expect them to provide timely communication on a large outage like this one. I think they failed in this and some people at Forrester Research seem to agree with me.
The updates from Amazon have been okay in my opinion. They could have been better, sure, but do we want the engineers working on this to stop every 20 minutes to write up obscure details that don't yet paint a cohesive picture? The bottom line is better updates won't help them fix the problem faster, and actually could distract from the resolution.
If they don't post any more details even after everything's back up and running, then there will be something to complain about.