So in all seriousness, how do folks deal with this? In this case, it ended up be...

pjlegato · on April 12, 2016

Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud, because:

1) the cost of mitigating that risk is much higher than the cost of just eating the outage, and

2) their high traffic production site is routinely down for that long anyway, for unrelated reasons.

If you really, really can't bear the business costs of an entire provider ever going down, even that rarely (e.g. you're doing life support, military systems, big finance), then you just pay a lot of money to rework your entire system into a fully redundant infrastructure that runs on multiple providers simultaneously.

There really aren't any other options besides these two.

nemik · on April 12, 2016

This here is right on.

I will add that if you can afford the time and effort to do so, it would be good to design your system in the beginning to work on multiple providers without many issues. That means trying as hard as you can to use as little provider-specific things as you can (RDS, DynamoDB, SQS, BigTable, etc). In most cases, pjlegato's 1) will still apply.

But you get a massive side-benefit (main benefit, I think) in cost. There are huge bidding wars between providers and if you're a startup and know how to play them off each other, you could even get away with not having to pay hosting costs for years. GC, AWS, Azure, Rackspace, Aliyun, etc, etc are all fighting for your business. If you've done the work to be provider-agnostic, you could switch between them with much less effort and reap the savings.

ghshephard · on April 12, 2016

If you are doing life support, military systems, or HA big finance, then you are quite likely to be running on dedicated equipment, with dedicated circuits, and quite often highly customized/configured non-stop hardware/operating systems.

You are unlikely to be running such systems on AWS or GCE.

creshal · on April 12, 2016

And that's why IBM is still in the server business: There's nothing like a mainframe when it comes to uptime.

ghshephard · on April 12, 2016

HP also has some good products in the highly available space - http://h20195.www2.hp.com/v2/getpdf.aspx/4aa4-2988enw.pdf , likely from their acquisition of Tandem.

creshal · on April 13, 2016

> likely from their acquisition of Tandem.

Yep. Those were originally Itanium-only, so their success was somewhat… limited, compared to IBM's "we're backwards compatible to punch cards" mainframes.

Only recently did Intel start to port over the mission critical features like CPU hotswap to Xeons, so they can finally let the Itanic die, so we're hopefully going to see more x86 devices with mainframe-like capabilities.

manigandham · on April 13, 2016

IBM also owns Softlayer which is a great cloud provider for the more traditional VM/dedicated servers architecture.

atemerev · on April 12, 2016

And have similar failure rates. Human errors are inevitable.

oxplot · on April 12, 2016

> even when hosting on a major cloud

Hosting on anything/anywhere really. Even if one builds clusters with true 100% reliability running on nuclear power buried 100 feet underground, you still have to talk to the rest of the world through a network which can fall apart for variety of reasons. If most of your users are on their mobile phones, they might not even notice outages.

At some point adding an extra 9 to the service availability can no longer be justified for the associated cost.

ryanobjc · on April 12, 2016

Also, 20 minutes every 3 years is 5-9s anyways.

kuschku · on April 12, 2016

> Most people just deal with it and accept that their site will go down for 20 minutes every 3-4 years or so, even when hosting on a major cloud

If THAT is what I get for the prices of Google Cloud Engine, I could just as well use OVHs cloud -- uptime isn't worse, and price is a lot cheaper.

_delirium · on April 12, 2016

Depends entirely on your business, but what I do is just tolerate the occasional 15-minute outage. There's increasing cost to getting more 9's on your uptime, and for me, engineering a system that has better uptime than Google Cloud does, by doing multi-cloud failover, is way out of favorable cost/benefit territory.

kbenson · on April 12, 2016

That's the only sane thing to do.

It is impossible to ensure 100% uptime, and it gets increasingly harder to approach that as you put more separate entities between yourself and the client. The thing is, you'll be blamed for problems that aren't in your control and aren't really related to your service, but to the customer. For example, local connectivity, phone data service, misbehaving DNS resolvers, packet mangling routers, mis-configured networks, mis-advertised network routes, etc. Every single one of those examples can happen before the customer traffic even gets to the internet, much less where you have your servers housed.

All you can do is accept that there will be problems that are attributed to your service, rightly so or not, and work to mitigate and reduce the possibility the problems you can, and learn it's not the end of the world.

dgacmu · on April 12, 2016

One answer is to evaluate the uptime you get with a single cloud provider and figure out if it meets your needs. This outage means that for the year, GCE will have at most a .999948 == four and a half "nines" of uptime. From a networkworld article in 2015: http://www.networkworld.com/article/2866950/cloud-computing/... And 2015: http://www.networkworld.com/article/3020235/cloud-computing/...

The article quotes Ben Traynor as saying that Google aimed for and hit 99.95% uptime, which is 4.3 hours of downtime per year.

My guess is that, despite cloud outages being painful, many applications are probably going to meet their cost/SLO goals anyway. Going up from 4 9s starts to get very expensive very quickly.

scurvy · on April 12, 2016

Most people and customers are tolerant of 15 minutes downtime here and there once or twice a year. Sure, there will be loudmouths who claim to be losing thousands of dollars in sales or business, but they're usually lying and/or not customers you want to have. They'll probably leave to save $1 per month with another vendor.

It sucks but the days of "ZOMG EBAY/MICROSOFT/YAHOO DOWN!!11!" on the cover/top of slashdot and CNET are gone. Hell, slashdot and CNET are basically gone.

akbar501 · on April 12, 2016

IMHO, the next wave is likely multi-cloud. Enterprises that require maximum uptime will likely run infrastructure that spans multiple cloud providers (and optionally one or more company controlled data centers).

OneOps (http://oneops.com) from WalmartLabs enables a multi-cloud approach. Netflix Spinnaker also works across multiple cloud providers.

DataStax (i.e. Cassandra) enables a multi-cloud approach for persistent storage.

DynomiteDB (disclaimer: my project) enables a multi-cloud approach for cache and low latency data.

Combine the above with microservices that are either stateless or use the data technologies listed above and you can easily develop, deploy and manage applications that continue to work even when an entire cloud provider is offline.

sdenton4 · on April 12, 2016

Get enough things running on multi-cloud, and you could potentially see multi cloud rolling failures, caused by (for example) a brief failure in service A leading to a gigantic load-shift to service B...

Matt3o12_ · on April 12, 2016

This assumes all the software your stack uses and you deploy is completely bug free. While rare, bugs can occure that have been in production for a long time and those will only occur when you hit a certain conditions. Now, all your services are down. 100% is impossible.

Also, if there is a problem with one component of your stack that could have run off a cloud services, chances are Google, or Amazon will fix your edge condition much quicker then you.

bpchaps · on April 12, 2016

By pretending it's the still the golden age of the internet and use physical servers in those locations. You might have to hire some admins, though ;).

taneq · on April 12, 2016

Hey now, we do want 100% uptime but let's not get hasty.