>are distributed systems naturally more resilient? All else being equal: Yes. It...

steve1977 · 2024-05-23T13:15:14 1716470114

RAID1 is mirrored. That is not what I would call a typical distributed system. It is a very redundant system. Like a cluster.

A distributed system without redundancy would rather be something like data stripped across disks without parity.

And that actually makes it less resilient, because failure of one component can bring down the whole system and the likelihood of failure is statistically higher because of the higher number of components.

Gormo · 2024-05-23T14:05:21 1716473121

When I think of distributed systems, the RAID1 analogy seems much more applicable than RAID0.

The term "distributed" has been traditionally applied to the original design of the TCP/IP protocol, various application-layer protocols like NNTP, IRC, etc., with the common factor being that each node operates as a standalone unit, but with nodes maintaining connectivity to each other so the whole system approaches a synchronized state -- if one node fails. the others continue to operate, but the overall system might become partitioned, with each segment diverging in its state.

The "RAID0" approach might apply to something like a Kubernetes cluster, where each node of the system is an autonomous unit, but each node performs a slightly different function, so that if any one node fails, the functionality of the overall system is blocked.

That second approach seems more consistent with what we traditionally label as "distributed" -- for example, the original design of the TCP/IP protocol, along with lots of application-layer protocols like NNTP and IRC, have each node operating autonomously but synchronized to other nodes so the whole system approaches a common data state. If one node fails, the other nodes all continue to operate, but might cause the overall system to become partitioned, leading to divergent states in each disconnected segment.

The CAP theorem comes to mind: the first approach maintains availability but risks consistency, the second approach maintains consistency but risks availability. But the second approach seems like a variant implementation strategy for what is still effectively a centralized system -- the overall solution still exists only as a single instance -- so I usually think of the first approach when something is described as "distributed".

bradjohnson · 2024-05-23T20:32:45 1716496365

You're assuming a stateful system where the state is distributed throughout the components of the system. For a stateless component of a distributed system, you don't need redundancy to recover from an outage.

>likelihood of failure is statistically higher because of the higher number of components

Yes, absolutely true, but resiliency for a distributed system is not necessarily like your example of data stripped without parity, unless we're specifically talking about distributed storage.

CWuestefeld · 2024-05-23T11:58:52 1716465532

To the GP's point - if you lose the RAID controller, then you've lost a whole lot more than just a single drive failure.

Gormo · 2024-05-23T14:11:12 1716473472

The controller isn't stateful; it's just an interface to the disks. If the controller fails, but the disks haven't, then all you've lost is the time it takes to plug the disks into a new controller.

With RAID1, there's also nothing specific to the RAID configuration inherent in the way the data is encoded on the disk. You might have to carefully replicate your configuration to access the filesystem from a failed RAID0 array, but you can just pull and individual disk out of a RAID1 array and use it normally as a standalone disk.

Dalewyn · 2024-05-23T12:13:58 1716466438

Yes, RAID isn't a backup, but it is resilient.

You will have a better chance at uptime with a RAID than a single drive so you hopefully don't have to climb up ventilation ducts, walk across broken glass, and kill anyone sent to stop you on your quest to reconnect those cables that were cut.