The phrase "as long as the same overall level of reliability is achieved" is log...

joshuamorton · 2024-11-03T22:42:28 1730673748

Justify that claim.

In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).

Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)

There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?

anewplace · 2024-11-04T00:37:14 1730680634

I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.

Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.

There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.

retinaros · 2024-11-03T23:50:35 1730677835

what happened in gcp paris region then?

joshuamorton · 2024-11-04T00:18:40 1730679520

One of the vanishingly small set of issues I mentioned.

It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.

Like, all cloud providers still have regional outages.

belter · 2024-11-04T10:06:24 1730714784

I think you should have started this discussion by disclosing you work at Google...

> One of the vanishingly small set of issues

At your scale, this attitude is even more concerning since the rare event at scale is not rare anymore.

joshuamorton · 2024-11-04T22:53:43 1730760823

I think you're abusing the saying "at scale, rare events aren't rare" (https://longform.asmartbear.com/scale-rare/ etc.) here. It is true that when you are running thousands of machines, events that happen rarely happen often, but that scale usually becomes relevant at thousands, or hundreds of thousands, or millions of things (https://www.backblaze.com/cloud-storage/resources/hard-drive...).

That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.

Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.