What if you have dozens of big data centers?

jiggawatts · 2024-11-03T21:36:39 1730669799

To reinforce your point:

The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.

Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.

Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!

belter · 2024-11-03T21:57:32 1730671052

I provided three different references. Despite the massive downvotes on my comment I guess by Google engineers, as a troll...:-)I take comfort on the fact nobody was able to advance a reference to prove me wrong.

nl · 2024-11-04T05:35:31 1730698531

AWS Zone is sort-roughly-kinda a GCP Region. It sounds like you want multi-region: https://cloud.google.com/compute/docs/regions-zones

belter · 2024-11-04T10:35:58 1730716558

> It sounds like you want multi-region

If you use Google Cloud....with the 100 ms of latency that will add to every interaction....

joshuamorton · 2024-11-03T22:15:47 1730672147

You haven't actually made an argument.

It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.

It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.

belter · 2024-11-03T22:27:16 1730672836

The phrase "as long as the same overall level of reliability is achieved" is logically flawed when discussing physically co-located vs. geographically separated infrastructure.

joshuamorton · 2024-11-03T22:42:28 1730673748

Justify that claim.

In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).

Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)

There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?

anewplace · 2024-11-04T00:37:14 1730680634

I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.

Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.

There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.

retinaros · 2024-11-03T23:50:35 1730677835

what happened in gcp paris region then?

joshuamorton · 2024-11-04T00:18:40 1730679520

One of the vanishingly small set of issues I mentioned.

It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.

Like, all cloud providers still have regional outages.

belter · 2024-11-04T10:06:24 1730714784

I think you should have started this discussion by disclosing you work at Google...

> One of the vanishingly small set of issues

At your scale, this attitude is even more concerning since the rare event at scale is not rare anymore.

joshuamorton · 2024-11-04T22:53:43 1730760823

I think you're abusing the saying "at scale, rare events aren't rare" (https://longform.asmartbear.com/scale-rare/ etc.) here. It is true that when you are running thousands of machines, events that happen rarely happen often, but that scale usually becomes relevant at thousands, or hundreds of thousands, or millions of things (https://www.backblaze.com/cloud-storage/resources/hard-drive...).

That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.

Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.