> The failures should be relatively rare; when I say relatively I mean on the le...

dx034 · on April 27, 2022

Which still means you could implement a hard limit of 1 fail per hour and only allow more replacements with manual intervention. With a thousand nodes, several or hundreds failing within a few hours is so unlikely that you're probably better off preventing automatic failover in these cases.

But that generally mirrors my experience that automatic failover for stable software tends to cause more issues than it solves. A good (i.e. redundant hardware and software) Postgresql server is also so unlikely to fail that wrong detection and cascading issues from automatic failover are more likely than its actual benefits.

xyzzy_plugh · on April 27, 2022

I think you're looking at it the wrong way. A server is never just postgres or memcached, there's always other stuff running, and it's that other stuff that can cause problems. Like maybe you're patching the fleet and a node fails to come back up, or due to misconfiguration the disk gets full.

I'd argue that stable systems are actually worse for operational stability as you become complacent and comfortable and when shit hits the fan you are unprepared.