So in all seriousness, how do folks deal with this?
In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.
But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.
Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud, because:
1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and
2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.
If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.
There really aren't any other options besides these two.
I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.
But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.
If you are doing life support, military systems, or HA big finance, then you are quite likely to be running on dedicated equipment, with dedicated circuits, and quite often highly customized/configured non-stop hardware/operating systems.
You are unlikely to be running such systems on AWS or GCE.
Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.
Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.
Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.
At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.
Depends entirely on your business, but what I do is just tolerate the occasional 15-minute outage. There's increasing cost to getting more 9's on your uptime, and for me, engineering a system that has better uptime than Google Cloud does, by doing multi-cloud failover, is way out of favorable cost/benefit territory.
It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.
All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.
The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.
My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.
Most people and customers are tolerant of 15 minutes downtime here and there once or twice a year. Sure, there will be loudmouths who claim to be losing thousands of dollars in sales or business, but they're usually lying and/or not customers you want to have. They'll probably leave to save $1 per month with another vendor.
It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.
IMHO, the next wave is likely multi-cloud. Enterprises that require maximum uptime will likely run infrastructure that spans multiple cloud providers (and optionally one or more company controlled data centers).
OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.
DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.
DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.
Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.
Get enough things running on multi-cloud, and you could potentially see multi cloud rolling failures, caused by (for example) a brief failure in service A leading to a gigantic load-shift to service B...
This assumes all the software your stack uses and you deploy is completely bug free. While rare, bugs can occure that have been in production for a long time and those will only occur when you hit a certain conditions. Now, all your services are down. 100% is impossible.
Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.
By pretending it's the still the golden age of the internet and use physical servers in those locations. You might have to hire some admins, though ;).
In this case, it ended up being a multi-region failure, so your only real solution is to spread it across providers, not just regions.
But I imagine it's a similar issue to scaling across regions, even within a provider. We can spin up machines in each region to provide fault tolerance, but we're at the mercy of our Postgres database.
What do others do?