If there's one small flaw in the system it seems that the whole thing begins to fail.
I wonder if there is selection bias underlying that judgment? Reading all of Amazon's post-mortems of big events, it does seem that the whole thing is fragile. What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact. It's the ones that turn into highly visible failures that we read about.
That said, I agree with you that EBS in particular seems to have more downtime than I'd expect. (And that other services like ELB depends on it makes it cascade in a way that's hard to design highly-available systems.)
> What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact.
Isn't that the whole point of moving to The Cloud? There's supposed to be some magical system in place such that hardware failures are routed around and don't interrupt service. Of course you can roll this yourself with your own hardware, but this is done for you.
It should be no small surprise that a system complicated enough to appear magical has some crazy complexity behind the scenes, and accidental dependencies can result in catastrophic failure.
I wonder if there is selection bias underlying that judgment? Reading all of Amazon's post-mortems of big events, it does seem that the whole thing is fragile. What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact. It's the ones that turn into highly visible failures that we read about.
That said, I agree with you that EBS in particular seems to have more downtime than I'd expect. (And that other services like ELB depends on it makes it cascade in a way that's hard to design highly-available systems.)