Does anyone know how often an AZ experiences an issue as compared to an entire region? AWS sells the redundancy of AZs pretty heavily, but it seems like a lot of the issues that happen end up being region-wide. I'm struggling to understand whether I should be replicating our service across regions or whether the AZ redundancy within a region is sufficient.
I've been naively setting up our distributed databases in separate AZs for a couple years now, paying, sometimes, thousands of dollars per month in data replication bandwidth egress fees. As far as I can remember I've never never seen an AZ go down, and the only region that has gone down has been us-east-1.
There was an AZ outage in Oregon a couple months back. You should definitely go multi AZ without hesitation for production workloads for systems that should be highly available. You can easily lose a system permanently in a single AZ setup if it’s not ephemeral.
The stuff that's exclusively hosted in us-east-1 is, to my knowledge, mostly things that maintain global uniqueness. CloudFront distributions, Route53, S3 bucket names, IAM roles and similar- i.e. singular control planes. Other than that, regions are about as isolated as it gets, except for specific features on top.
Availability zones are supposed to be another fault boundary, and things are generally pretty solid, but every so often problems spill over when they shouldn't.
The general impression I get is that us-east-1's issues tend to stem from it being singularly huge.
If I recall there was a point in time where the control panel for all regions was in us-east-1. I seem to recall an outrage where the other regions were up, but you couldn’t change any resources because the management api was down in us-east-1
Literally all our AWS resources are in EU/UK regions - and they all continued functioning just fine - but we couldn't sign in to our AWS console to manage said resources.
Thankfully the outage didn't impact our production systems at all, but our inability to access said console was quite alarming to say the least.
It would probably be clearer that they exist if the console redirected to the regional URL when you switched regions.
STS, S3, etc have regional endpoints too that have continued to work when us-east-1 has been broken in the past and the various AWS clients can be configured to use them, which they also sadly don't tend to do by default.
AWS has been getting a pass on their stability issues in us-east-1 for years now because it’s their “oldest” zone. Maybe they should invest in fixing it instead of inventing new services to sell.
I certainly wouldn't describe it as “a pass” given how commonly people joke about things like “friends don't let friends use us-east-1”. There's also a reporting bias: because many places only use us-east-1, you're more likely to hear about it even if it only affects a fraction of customers, and many of those companies blame AWS publicly because that's easier than admitting that they were only using one AZ, etc.
These big outages are noteworthy because they _do_ affect people who correctly architected for reliability — and they're pretty rare. This one didn't affect one of my big sites at all; the other was affected by the S3 / Fargate issues but the last time that happened was 2017.
That certainly could be better but so far it hasn't been enough to be worth the massive cost increase of using multiple providers, especially if you can have some basic functionality provided by a CDN when the origin is down (true for the kinds of projects I work on). GCP and Azure have had their share of extended outages, too, so most of the major providers tend to be careful to cast stones about reliability, and it's _much_ better than the median IT department can offer.
I agree with you, but my services are actually in Canada (Central). There's only one region in Canada, so I don't really have an alternative. AWS justifies it by saying there are three AZs (distinct data centres) within Canada (Central), but I get scared when I see these region-wide issues. If the AZs were really distinct, you wouldn't really have region-wide issues.
Take DynamoDB as an example. The AWS managed service takes care of replicating everything to multiple AZs for you, that's great! You're very unlikely to lose your data. But, the DynamoDB team is running a mostly-regional service. If they push bad code or fall over it's likely going to be a regional issue. Probably only the storage nodes are truly zonal.
If you wanted to deploy something similar, like Cassandra across AZs, or even regions you're welcome to do that. But now you're on the hook for the availability of the system. Are you going to get higher availability running your own Cassandra implementation than the DynamoDB team? Maybe. DynamoDB had a pretty big outage in 2015 I think. But that's a lot more work than just using DynamoDB IMO.
> But, the DynamoDB team is running a mostly-regional service.
this is both more and less true than you might think. for most regional endpoints teams leverage load balancers that are scoped zonally, such that ip0 will point at instances in zone a, ip1 will point at instances in zone b, and so on. Similarly, teams who operate "regional" endpoints will generally deploy "zonal" environments, such that in the event of a bad code deploy they can fail away that zone for customers.
that being said, these mitigations still don't stop regional poison pills or otherwise from infecting other AZs unless the service is architected to zonally internally.
Yeah, teams go to a lot of effort to have zonal environments/fleets/deployments... but there are still many, many regional failure modes. For example, even in a foundational service like EC2 most of their APIs touch regional databases.
It can be a bit hard to know, since the AZ identifiers are randomized per account, so if you think you have problems in us-west-1a, I can't check on my side. You can get the AZ ID out of your account to de-randomize things, so we can compare notes, but people rarely bother, for whatever reason.
Over two years I think we'd see about 2-3 AZ issues but only once I would consider it an outage.
Usually there would be high network error rates which were usually enough to make RDS Postgres fail over if it was in the impacted AZ
The only real "outage" was DNS having extremely high error rates in a single us-east-1 AZ to the point most things there were barely working
Lack of instance capacity, especially spot, especially for the NVMe types was common of CI (it used ASGs for builder nodes). It'd be pretty common for a single AZ to run out of spot instance types--especially the NVMe ([a-z]#d types)