Clever trick. Having dealt with very similar things using Cassandra, I'm curious how this setup will react to a failure of a local Nvme disk.
They say that GCP will kill the whole node, which is probably a good thing if you can be sure it does that quickly and consistently.
If it doesn't (or not fast enough) you'll have a slow node amongst faster ones, creating a big hotspot in your database. Cassandra doesn't work very well if that happens and in early versions I remember some cascading effects when a few nodes had slowdowns.
That's good observation. We've spent a lot of time on our control plane which handles the various RAID1 failure modes, e.g. when a RAID1 degrades due to failed local SSD, we force stop the node so that it doesn't continue to operate as a slow node. Wait for part 2! :)
that's always a risk when using local drives and needing to rebuild when a node dies but I guess they can over provision in case of one node failure in cluster until the cache is warmed up
Edit:
Just wanted to add that because they are using Persistent Disks as the source of truth and depending on the network bandwidth it might not be that big of a problem to restore a node to a working state if it's using a quorum for reads and RP >= 3.
Resorting a Node from zero in case of disk failure will always be bad.
They could also have another caching layer on top of the cluster to further mitigated the latency issue until the nodes gets back to health and finishes all the hinted handoffs.
We have two ways of re-building a node under this setup.
We can either re-build the node by simply wiping its disks, and letting it stream in data from other replicas, or we can re-build by simply re-syncing the pd-ssd to the nvme.
Node failure is a regular occurrence, it isn't a "bad" thing, and something we intend to fully automate. Node should be able to fail and recover without anyone noticing.
Thanks for the info that's exactly what I thought the recovery process should be.
Node failure isn't bad until it's a cascading catastrophe :)
But by bad I meant when a Node is based on local disks, that Cassandra and ScyllaDB usually recommends.
Depending on the time between snapshots and the restore from snapshot process (if there are even snapshots...) can be problematic.
Bootstrapping nodes from zero depending on the cluster size (Big data nodes while not recommended are pretty common) could take days in Cassandra because the streaming implements was (maybe still is?) very bad
Doesn't streaming affect the node network?
Isn't that an issue or do you have a dedicated NIC for intra node communications? or do you just limit the streaming bandwidth?
No dedicated NICs, the link speeds are fast enough to not really worry about this.
It's also worth mentioning that in a cluster of N, to recover a node, it simply needs to stream 1/(N - 1) of the data from its neighboring nodes. So when you look at the cluster as a whole, and the strain on each node that is UP and serving traffic, it's insignificant.
They say that GCP will kill the whole node, which is probably a good thing if you can be sure it does that quickly and consistently.
If it doesn't (or not fast enough) you'll have a slow node amongst faster ones, creating a big hotspot in your database. Cassandra doesn't work very well if that happens and in early versions I remember some cascading effects when a few nodes had slowdowns.