My biggest take away from this is "I certainly hope this person isn't responsibl...

ahupp · on June 11, 2021

Not sure what to say, but this is how it works on all the large systems I'm familiar with.

Image you have two servers, each with two 1TB disks (4TB physical storage). And you have two distinct services with 1TB datasets, and want some storage redundancy.

One option is to put each pair of discs in a RAID-1 configuration, and so each RAID instance holds one dataset. This protects against a disk failure, but not against server or network failures.

Your other option is to put one copy of the dataset on each server. Now you are protected from the failure of any single disk, server, or the network to that server.

In both cases you have 2TB of logical storage available.

wafflespotato · on June 11, 2021

you're putting yourself at risk of split-brain though (or downtime due to fail-over (or the lack of it)).

In either case what you're describing isn't really the 'cloud' alternative.

dan_quixote · on June 11, 2021

Yeah, that's a complication/cost of HA that a significant portion of industry has long accepted. Everywhere I've been in the last 5 years has had this assumption baked in across all distributed systems.

nightfly · on June 11, 2021

Where possible, my organization tries and have services deployed in sets of three (and try and require a quorum) to reduce/eliminate split-brain situations. And we're very small scale.

411111111111111 · on June 11, 2021

I think you're giving him too little credit there.

The parents point is really spot on though: most websites aren't at scale where every stateful api has redundant implementations. But the author's point does have merit: inevitably, something goes wrong with all systems - and when it does, your system goes down if you trust your HA config to work. If you actually did go for redundant implementations, your users likely aren't even gonna notice anything went wrong.

It's however kinda unsustainable to maintain unless you have a massive development budget

alisonkisk · on June 11, 2021

But why is his layer the only correct layer for redundancy?

He's also doing wasted work his redundancy has bugs that a consumer has to worry about working around.

411111111111111 · on June 11, 2021

We're currently quite off-topic if i'm honest.

I think the author was specifically talking about RAID-1 redundancy and is advocating that you can leave your systems with RAID-0 (so no redundant drives in each server), as you're gonna need multiple nodes in your cluster anyway... so if any of your systems disks break, you can just let it go down and replace the disk while the node is offline.

but despite being offtopic: redundant implementations are -from my experience - not used in a failover way. they're active at all times and load is spread if you can do that, so you'd likely find the inconsistencies in the integration test layer.

rbanffy · on June 11, 2021

> not used in a failover way.

Aurora works like that. Read replicas are on standby and become writable when the writable node dies or is replaced. They can be used, of course, but so can other standby streaming replicas.

smueller1234 · on June 11, 2021

I'm not the original author but I used to work with him and now do in storage infrastructure at Google. As others pointed out, what the author, Kris, writes kind of implies/requires a certain scale of infrastructure to make sense. Let me try to provide at least a little bit of context:

The larger your infrastructure, the smaller the relative efficiency win that's worth pursuing (duh, I know, engineering time costs the same, but the absolute savings numbers from relative wins go up). That's why an approach along the lines of "redundancy at all levels" (raid + x-machine replication + x-geo replication etc) starts becoming increasingly worth streamlining.

Another, separate consideration is types of failures you have to consider: an availability incident (temporary unavailability) vs. durability (permanent data loss). And then it's worth considering that in the limit to long durations, an availability incident will become the same as a durability incident. This is contextual: To pick an obvious/illustrative example, if your Snapchat messages are offline for 24h, you might as well have lost the data instead.

Now, machines fail, of course. Doing physical maintenance (swapping disks) is going to take significant, human time scales. It's not generally tolerable for your data to be offline for that long. So local RAID barely helps at all. Instead, you're going to want to make sure your data stays available despite a certain rate of machine failure.

You can now make similar considerations for different, larger domains. Network, power, building, city/location, etc. They have vastly different failure probabilities and also different failure modes (network devices failing is likely an availability concern, a mudslide into your DC is a bit less likely to recover). Depending on your needs, you might accept some of these but not others.

The most trivial way to deal with this is to simply make sure you have a replica of each chunk of data in multiple of each of these kinds of failure zones. A replica each on multiple machines (pick the amount of redundancy you need based on a statistical model from component failure rates), a replica each on machines under different network devices, on different power, in different geographies, etc.

That's expensive. The next most efficient thing would be to use the same concept as RAID (erasure codes) and apply that across a wider scope. So you basically get RAID, but you use your clever model of failure zones for placement.

This gets a bit complicated in practice. Most folks stick to replicas. (Eg. last I looked, for example HDFS/Hadoop only supported replication, but it did use knowledge of network topology for placing data.)

The reason why you don't want to do this in your application is because it's really kinda complicated. You're far more likely to have many applications than many storage technologies (or databases).

Now, at some point of infrastructure size or complexity or team size it may make sense to separate your storage (the stuff I'm talking about) from your databases as well. But as Kris argues, many common databases can be made to handle some of these failure zones.

In any case, that's the extremely long version of an answer to your question why you'd handle redundancy in this particular layer. The short answer is: Below this layer is too small a scope or with too little meta information. But doing it higher in the stack fails to exploit a prime opportunity to abstract away some really significant complexity. I think we all know how useful good encapsulation can be! You avoid doing this in multiple places simply because it's expensive.

(Everything above is common knowledge among storage folks, nothing is Google specific or otherwise it has been covered in published articles. Alas, the way we would defend against bugs in storage is not public. Sorry.)

sterlind · on June 12, 2021

Erasure codes are indeed a great illustration of the scale concept. If your designing to tolerate 2 simultaneous failures with 3 drives you need RAID-1 with N=3, 3x the cost of a single copy. If you have 15 drives you can do 12-of-15 erasure coding which is only 1.25x the cost of a single copy.