This is a well-written post-mortem for public reading. I encourage people to rea...

This is a well-written post-mortem for public reading. I encourage people to read through it.

Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.

Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.

Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.

Nice work!