Hacker News new | past | comments | ask | show | jobs | submit login

Correct.

I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.

Realistically you can’t trust one provider at all if you have high availability requirements.

The justification is apparently that the cloud is taking all this responsibility away from people but from personal experience running two cages of kit at two datacenters the TCO was lower and the reliability and availability higher. Possibly the largest cost is navigating Harry-Potter-esque pricing and automation laws. The only gain is scaling past those two cages.

Edit: I should point out however that an advantage of the cloud is actually being able to click a couple of buttons and get rid of two cages worth of DC equipment instantly if your product or idea doesn't work out!




>you can’t trust one provider at all

The hard part with multi-cloud is, you're just increasing your risk of being impacted by someone's failure. Sure if you're all-in on AWS and AWS goes down, you're all-out. But if you're on [AWS, GCP] and GCP goes down, you're down anyway. Even though AWS is up, you're down because Google went down. And if you're on [AWS, GCP, Azure] and Azure goes down, it doesn't matter than AWS and GCP are up... you're down because Azure is down. The only way around that is architecting your business to run with only one of those vendors, which means you're paying 3x more than you need to 99.99999% of the time.

The probability that one of [AWS, Azure, GCP] is down is way higher than the probability that just one of them is down. And the probability that your two cages in your datacenter is down is way higher than the probability that any one of the hyperscalers is down.


> which means you're paying 3x more than you need to 99.99999% of the time.

This would be a poor decision. If you assume AWS, GCP, and Azure would fail independently, you can pay 1.5x. Each of the 3 services would be scaled to take 50% of your traffic. If any one fails, you would then still be able to handle 100%. This is a common way to structure applications. Assuming independence means that more replicas result in less overprovisioning. 1 replica means needing to provision 2x. Having 5 independent replicas means, you need to provision 1.25x to be resilient against one failure as each replica will be scaled at 25%.

In general, N replicas need N/(N-1) over provisioning to be resilient against one replica failing.


I disagree. It’s about mitigating the risk of a single provider’s failure. Single providers go down all the time. We’ve seen it from all three major cloud vendors.


You disagree with what? That relying on three vendors increases your risk of being impacted by one? That's just statistics. You can disagree with it, but that doesn't make it incorrect.

Or do you disagree that planning for a total failure of one and running redundant workloads on other vendors increases your costs 99.99999% of the time? Because that's a fairly standard SLA from each of the major vendors. Let's even reduce it to EC2's SLA, 99.99%. So 99.99% of the time you're paying 3x as much as you need to be paying just to maintain your services an extra four hours per year. Again, you can disagree with that but that doesn't make it incorrect.

Some businesses might need that extra four hours, the cost of the extra services might be cheaper than the cost of four hours of downtime per year. But you're not going to find many businesses like that. Either you're running completely redundant workloads, paying 3x as much for an extra 4 hours per year, or you're going to be taken offline when any one of the three go down independently of each other.

Single providers go down, yes. And three providers go down three times as often as one. Either you're massively overspending or you're tripling your risk of downtime. If multi-cloud worked, you'd be hearing people talking about it and their success stories would fill the front page of Hacker News. They don't, because it doesn't.


How do you test failover from provider-wide outages?

I’ve never heard of an untested failover mechanism that worked. Most places are afraid to invoke such a thing, even during a major outage.


That’s fairly simple. Regular scenario planning, drills and suitable chaos engineering tooling.

Being afraid of failures is a massive sign of problems. I’ve worked in those sorts of places before.


ACM and CloudFront being sticky to us-east-1 is particularly annoying. I’m happy not being multi regional (I don’t have that level of DR requirements), but these types of services require me to incorporate all the multi region complexity.


Harry-Potter-esque pricing?

Is that a reference to the difficulty of calculating the cost of visiting all the rides at universal? That's my best guess...


It's more a stab at the inconsistency of rules around magic.

"Well this pricing rule only works on a Tuesday lunch time if you're standing on one leg with a sausage under each arm and a traffic cone on your head"

And there are a million of those to navigate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: