"Even if you had the state of the art microservice architecture running on a kub...

computerex · on Feb 16, 2017

You can plan all you'd like, failures happen not necessarily due to poor planning but because in real life, shit happens. Pokemon Go for instance experienced like 50x the amount of traffic they planned for.

Secondly, software companies like Microsoft, Google and IBM might know a thing or two about running data centers. Due to economies of scale, these companies are inherently in a better position to supply hardware at scale.

> If entire region is down do you think other regions can handle the load. If you think so you're kidding yourself

Netflix routinely does just this to test the resilience of their systems. They pick a random AWS region, and they evacuate it. All the traffic is proxied to the other regions and eventually via DNS the traffic is routed entirely to the surviving regions. No interruption of service is experienced by the users.

Here's a visualization of Netflix simulating a failure on the US-east-1 region and failing over to US-west-1/US-west-2

https://www.youtube.com/watch?v=KVbTjlZ0sfE

The top right node is the one that fails. As the error rate climbs, traffic starts getting proxied over to the surviving nodes, until a DNS switch redirects all traffic to the surviving nodes. Netflix does this monthly, in production. They also run https://github.com/Netflix/SimianArmy on production.

The cloud enables fault tolerance, resiliency and graceful degradation.

hueving · on Feb 17, 2017

I think you missed the point, Netflix evacuating a region is not the same thing as that region failing. If the whole region goes down, their (AWS's) total capacity just took a major hit and unless they have obscenely over-provisioned (they haven't), shit is going to hit the fan when people start spinning up stuff in the remaining regions to make up for the loss.

hueving · on Feb 17, 2017

>The cloud enables fault tolerance, resiliency and graceful degradation

No, tooling to failover and spin up new instances does that. An enterprise with 3 data centers can do that.

"the cloud" is just doing it on someone else's hardware.