or the dev who spent a few nights and weekends rescuing the system after one of those 1% failures the customer, as it turns out, has no patience for at all
Disaster recovery is just one of many things that is much simpler in non-distributed systems.
You seem to be confusing a system that produces bad results 1% of the time with a system that's down 1% of the time. If you can only write the first kind of non-distributed system, you're in for a bad trip if you try to write a distributed equivalent.