Prevention is a big part of SRE, but an equally big part is formalizing a process to learn from the inevitable outages that come with running a large, complex, distributed system built by fallible humans.
You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.
You figure out what went wrong and fix it, of course, but more importantly, you figure out where your existing systems and processes (failover, monitoring, incident response, etc.) did and didn't work, and you improve them for the next time.