Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know how often an AZ experiences an issue as compared to an entire region? AWS sells the redundancy of AZs pretty heavily, but it seems like a lot of the issues that happen end up being region-wide. I'm struggling to understand whether I should be replicating our service across regions or whether the AZ redundancy within a region is sufficient.


I've been naively setting up our distributed databases in separate AZs for a couple years now, paying, sometimes, thousands of dollars per month in data replication bandwidth egress fees. As far as I can remember I've never never seen an AZ go down, and the only region that has gone down has been us-east-1.


There was an AZ outage in Oregon a couple months back. You should definitely go multi AZ without hesitation for production workloads for systems that should be highly available. You can easily lose a system permanently in a single AZ setup if it’s not ephemeral.


AZs definitely go down. It's usually due to a physical reason like fire or power issues.


> I've never never seen an AZ go down, and the only region that has gone down has been us-east-1.

Doesn't the region going down mean that _all_ its AZs have gone down? Or is my mental model of this incorrect?


No. See https://aws.amazon.com/about-aws/global-infrastructure/regio...

A region is a networking paradigm. An AZ is a group of 2-6 data centers in the same city more or less.

If a region goes down or is otherwise impacted, its AZs are unavailable or similar.

If an AZ goes down, your VMs in said centers are disrupted in the most direct sense.

It's the difference between loss of service and actual data loss.


Is that separate AZs within the same region, or AZs across regions? I didn't think there were any bandwidth fees between AZs in the same region.


It's $0.01/GB for cross-AZ transfer within a region.


In reality it's more like $0.02/GB. You pay $0.01 on sending and $0.01 on receiving. I have no idea why ingress isn't free.


Plus the support percentage, don’t forguet.


That is incorrect. Cross az fees are steep.


The main issue tends to be a lot of AWS internal components tend to be in us-east-1; it’s also the oldest zone.

So when failures happen in that region (and they happen more commonly than others due to age, scale, complexity) then they can be globally impacting.


The stuff that's exclusively hosted in us-east-1 is, to my knowledge, mostly things that maintain global uniqueness. CloudFront distributions, Route53, S3 bucket names, IAM roles and similar- i.e. singular control planes. Other than that, regions are about as isolated as it gets, except for specific features on top.

Availability zones are supposed to be another fault boundary, and things are generally pretty solid, but every so often problems spill over when they shouldn't.

The general impression I get is that us-east-1's issues tend to stem from it being singularly huge.

(Source: Work at AWS.)


If I recall there was a point in time where the control panel for all regions was in us-east-1. I seem to recall an outrage where the other regions were up, but you couldn’t change any resources because the management api was down in us-east-1


This was our exact experience with this outage.

Literally all our AWS resources are in EU/UK regions - and they all continued functioning just fine - but we couldn't sign in to our AWS console to manage said resources.

Thankfully the outage didn't impact our production systems at all, but our inability to access said console was quite alarming to say the least.


The default region for global services including https://console.aws.amazon.com is us-east-1, but there are usual regional alternatives. For example: https://us-west-2.console.aws.amazon.com

It would probably be clearer that they exist if the console redirected to the regional URL when you switched regions.

STS, S3, etc have regional endpoints too that have continued to work when us-east-1 has been broken in the past and the various AWS clients can be configured to use them, which they also sadly don't tend to do by default.


AWS has been getting a pass on their stability issues in us-east-1 for years now because it’s their “oldest” zone. Maybe they should invest in fixing it instead of inventing new services to sell.


I certainly wouldn't describe it as “a pass” given how commonly people joke about things like “friends don't let friends use us-east-1”. There's also a reporting bias: because many places only use us-east-1, you're more likely to hear about it even if it only affects a fraction of customers, and many of those companies blame AWS publicly because that's easier than admitting that they were only using one AZ, etc.

These big outages are noteworthy because they _do_ affect people who correctly architected for reliability — and they're pretty rare. This one didn't affect one of my big sites at all; the other was affected by the S3 / Fargate issues but the last time that happened was 2017.

That certainly could be better but so far it hasn't been enough to be worth the massive cost increase of using multiple providers, especially if you can have some basic functionality provided by a CDN when the origin is down (true for the kinds of projects I work on). GCP and Azure have had their share of extended outages, too, so most of the major providers tend to be careful to cast stones about reliability, and it's _much_ better than the median IT department can offer.


From the original outage thread:

“If you're having SLA problems I feel bad for you son I got two 9 problems cuz of us-east-1”


if you care about the availability of a single geographical availability zone, it's your own fault.


I agree with you, but my services are actually in Canada (Central). There's only one region in Canada, so I don't really have an alternative. AWS justifies it by saying there are three AZs (distinct data centres) within Canada (Central), but I get scared when I see these region-wide issues. If the AZs were really distinct, you wouldn't really have region-wide issues.


Take DynamoDB as an example. The AWS managed service takes care of replicating everything to multiple AZs for you, that's great! You're very unlikely to lose your data. But, the DynamoDB team is running a mostly-regional service. If they push bad code or fall over it's likely going to be a regional issue. Probably only the storage nodes are truly zonal.

If you wanted to deploy something similar, like Cassandra across AZs, or even regions you're welcome to do that. But now you're on the hook for the availability of the system. Are you going to get higher availability running your own Cassandra implementation than the DynamoDB team? Maybe. DynamoDB had a pretty big outage in 2015 I think. But that's a lot more work than just using DynamoDB IMO.


> But, the DynamoDB team is running a mostly-regional service.

this is both more and less true than you might think. for most regional endpoints teams leverage load balancers that are scoped zonally, such that ip0 will point at instances in zone a, ip1 will point at instances in zone b, and so on. Similarly, teams who operate "regional" endpoints will generally deploy "zonal" environments, such that in the event of a bad code deploy they can fail away that zone for customers.

that being said, these mitigations still don't stop regional poison pills or otherwise from infecting other AZs unless the service is architected to zonally internally.


Yeah, teams go to a lot of effort to have zonal environments/fleets/deployments... but there are still many, many regional failure modes. For example, even in a foundational service like EC2 most of their APIs touch regional databases.


Multiple AZs are moreso for earthquakes, fires[1], and similar disasters rather than software issues.

[1] https://www.reuters.com/article/us-france-ovh-fire-idUSKBN2B...


Good news, a new region is coming to Canada in the west[0] eta 2023/24

[0]: https://aws.amazon.com/blogs/aws/in-the-works-aws-canada-wes...


It can be a bit hard to know, since the AZ identifiers are randomized per account, so if you think you have problems in us-west-1a, I can't check on my side. You can get the AZ ID out of your account to de-randomize things, so we can compare notes, but people rarely bother, for whatever reason.


Amazon seems to have stopped randomizing them in newer regions. Another reason to move to us-east-2. ;-)


If you do a lot of VPC Endpoints to clients/thirdparties, you learn the AZIDs or you go to all AZIDs in a region by default.


the eu-central-1 datacenter fire earlier this year was purportedly just 1AZ, but it took down the entire region to all intents and purposes.

Our SOP is to cut over to a second region the moment we see any AZ-level shenanigans. We've been burned too often.


Over two years I think we'd see about 2-3 AZ issues but only once I would consider it an outage.

Usually there would be high network error rates which were usually enough to make RDS Postgres fail over if it was in the impacted AZ

The only real "outage" was DNS having extremely high error rates in a single us-east-1 AZ to the point most things there were barely working

Lack of instance capacity, especially spot, especially for the NVMe types was common of CI (it used ASGs for builder nodes). It'd be pretty common for a single AZ to run out of spot instance types--especially the NVMe ([a-z]#d types)


The best bang for your buck isn’t deploying into multiple AZs, but relocating everything into almost any other region than us-east-1.

My system is latency and downtime tolerant, but I’m thinking I should move all my Kafka processing over to us-west-2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: