> failover as STONITH (Shoot The Other Node In The Head) What functional consens...

grogers · on April 17, 2023

If your consensus protocol requires that it is probably broken. If you can't rely on a node to shut itself down then you almost certainly can't rely on an external trigger to do it. Paxos, raft, etc work just fine as long as failures are non-byzantine. Achieving non-byzantine failures is definitely not always possible (e.g. someone hacking your server and reprogramming it to subvert the protocol) but checksums on disk and network go most of the way.

olluk · on April 17, 2023

Perhaps the multi-master approach is the example of system where incoherent does not mean terminal illnesses.

remram · on April 18, 2023

Most consensus algorithms assume some subset of possible behaviors from the misbehaving nodes. The algorithms that don't are called "Byzantine" and are a very short list (e.g. the situation where a node can lie and maliciously try to misinform other nodes about the state of the system).

If you can tell that a node failed, there are usually other opportunities for circuit-breaking than shooting it, such as at the hypervisor, load-balancer, or even clients.