That was nuts. Interested to read the post-mortem on this one. Our site went down as well. What could cause a sudden all-region meltdown like that? Aren't regions supposed to be more isolated to prevent this type of thing?
Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...
Our website was down as well for 16 minutes. My guess is that it was a bad route that was pushed out simultaneously (probably was not the intention). It happened once before, sometimes last year, if I remember correctly. We'll have to wait and see what the definitive cause was though.
You get an AS number, and announce your own IP space. DNS failover only sort-of works.
Or your subscribe to a "GSLB" service where they do this for you for a significant fee. Or you use a "man-in-the-middle as a service" system like Cloudflare, who do this at an extremely reasonable and/or free cost.
Of course, you still have to deal with the risk of route leaks, BGP route flapping/dampening, and other things which can take your IP addresses offline despite the fact you are multihoming with different carriers in different locations.
So perhaps you setup IP addresses on different ASNs and use both DNS & IP based failover.
But then you find a bug somewhere in your software stack which makes all of this redundancy completely ineffective. So you just take your ball, go home and cry.
You put it in all your clouds, with low TTL DNS entries pointing at all those instances (or the closest one geographically maybe). Then if you're really paranoid you use redundant DNS providers as well.
And then you discover that there are a LOT of craptastic DNS resolvers, middle boxes, AND ISP DNS servers out there that happily ignore or rewrite TTLs. With a high-volume web service, you can have a 1 minute TTL, change your A records, and still see a lovely long tail of traffic hitting the old IP for HOURS.
The point was that adding another point for potential failure still won't reduce the chance of failure... it's just something else that can and will break.
In any case, failures happen, and most systems are better off being as simple as possible and accepting the unforeseen failures than trying to add complexity to overcome them.
Seems to have only been down for about 10 minutes, so I'm thinking some sort of mis-configuration that got deployed everywhere...they were working to fix a VPN issue in a specific region right before it went down...