Among experienced teams, most failures aren't caused by single-node/single-servi...

Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.

The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.