Clever trick. Having dealt with very similar things using Cassandra, I'm curious...

mikesun · on Aug 15, 2022

That's good observation. We've spent a lot of time on our control plane which handles the various RAID1 failure modes, e.g. when a RAID1 degrades due to failed local SSD, we force stop the node so that it doesn't continue to operate as a slow node. Wait for part 2! :)

TheGuyWhoCodes · on Aug 15, 2022

that's always a risk when using local drives and needing to rebuild when a node dies but I guess they can over provision in case of one node failure in cluster until the cache is warmed up

Edit: Just wanted to add that because they are using Persistent Disks as the source of truth and depending on the network bandwidth it might not be that big of a problem to restore a node to a working state if it's using a quorum for reads and RP >= 3.

Resorting a Node from zero in case of disk failure will always be bad.

They could also have another caching layer on top of the cluster to further mitigated the latency issue until the nodes gets back to health and finishes all the hinted handoffs.

jhgg · on Aug 15, 2022

We have two ways of re-building a node under this setup.

We can either re-build the node by simply wiping its disks, and letting it stream in data from other replicas, or we can re-build by simply re-syncing the pd-ssd to the nvme.

Node failure is a regular occurrence, it isn't a "bad" thing, and something we intend to fully automate. Node should be able to fail and recover without anyone noticing.

TheGuyWhoCodes · on Aug 15, 2022

Thanks for the info that's exactly what I thought the recovery process should be. Node failure isn't bad until it's a cascading catastrophe :)

But by bad I meant when a Node is based on local disks, that Cassandra and ScyllaDB usually recommends.

Depending on the time between snapshots and the restore from snapshot process (if there are even snapshots...) can be problematic.

Bootstrapping nodes from zero depending on the cluster size (Big data nodes while not recommended are pretty common) could take days in Cassandra because the streaming implements was (maybe still is?) very bad

jhgg · on Aug 15, 2022

Scylladb is luckily much faster. We can rebuild a ~5TB node in a matter of hours.

TheGuyWhoCodes · on Aug 15, 2022

Doesn't streaming affect the node network? Isn't that an issue or do you have a dedicated NIC for intra node communications? or do you just limit the streaming bandwidth?

Thanks.

jhgg · on Aug 16, 2022

No dedicated NICs, the link speeds are fast enough to not really worry about this.

It's also worth mentioning that in a cluster of N, to recover a node, it simply needs to stream 1/(N - 1) of the data from its neighboring nodes. So when you look at the cluster as a whole, and the strain on each node that is UP and serving traffic, it's insignificant.