Oh boy! I had a similar cascading failure situation once with a Nagios "cluster"...

Oh boy! I had a similar cascading failure situation once with a Nagios "cluster" I inherited. The previous engineer distributed the work between a master and 3 slave nodes with a backup mechanism such that if any of the slaves died, the load would go to the master. This was fine when he first created it but as more slaves were added, the master was running at capacity just dealing with the incoming data. So each each additional slave node, the probability of one of them failing and sending its load to overwhelm the master increased. Sometimes a poorly designed distributed system is worse than a single big server.

I ended up leveraging Consul to do leadership election (only for the alerting bit) and monitor the health of all the nodes in the cluster. If one of them failed, the load was redistributed equally among the remaining nodes.