> And you can’t “automate” away the rare things, even the technical ones. By the...

> And you can’t “automate” away the rare things, even the technical ones. By their nature they’re difficult to define, hence difficult to monitor, and difficult to repair without the forensic skills of a human engineer.

There are two ways to think about automating response to technical problems: a reactive way and a proactive way. Reactive automation looks to diagnose, repair, or work around system faults as they happen, within the constraints of the design of the system. The proactive approach happens early, at design or architecture time, in designing and building the system in such a way that it is resilient to rare failures and designed to be automatically fixed.

For example, think about a standard primary-failover database system with asynchronous replication. Reactive automation would monitor the primary, ensure it stayed down once it was deemed unhealthy, promote the secondary, provision a new secondary, and handle the small window of data loss. This works great for "fail stop" failures of the primary, and can often recover systems within seconds. Where it becomes much more difficult is when the primary is just slower than usual, or if there's packet loss, or if it's not clear whether it's going to come back up. Just the step of "make sure this primary doesn't come up thinking its still the primary" can be very tricky.

A more proactive approach would look at those hard problems, and try and prevent them by design. Could we design the replication protocol in such a way that the old primary is automatically fenced out? Could we balance load in a way that slower servers automatically get less traffic? Should we prefer synchronous replication or even active-active to avoid the hard question of what to do with lost data or slow primaries?

Looking through a certain lens, the answers to all those questions just add complexity to something simple. But, from another perspective, they take the hardest thing (weird at-scale failures) and turn them into something much simpler to handle. It's an explicit trade-off between system simplicity and operational simplicity. Folks without experience running actual systems tend to have poor intuition about this trade-off. The only way I've seen to help people build good intuition, and make the right decisions, is to get the same folks who are building large-scale systems deeply involved in the operations of those systems.