> You need to be triply redundant on 3 availability zones, (3x) both with the RD...

fulafel · on June 9, 2022

If you're not testing az failovers you're probably just wasting money, like untested backups... But it's true that most people don't know this and aren't testing it because they don't know that az failovers don't automatically just work(tm). And the second downside of course would be asymmetry between the envs and divergence in the IaC etc resulting in more complexity and engineering work.

(Of course you're probably wasting money anyway since you don't actually business-wise need better uptime than the single AZ, and your complexity induced human fumbles will cause much more outages anyway, but this has been a main selling point of the decision to go to AWS, so the requirement needs to be defended)

Yes, you can build automation to have the redundant stuff up only sometimes, if you eat the engineering effort and complexity in your IAC and build automation... in the general vein of justifying engineering spend to offset AWS operating costs where running containers is very expensive!

TLDR: either way you end up paying for the very high markup in compute prices, it'll just be easier to excuse jumping through expensive hoops to "save money" on it.

philliphaydon · on June 9, 2022

If you’re building your own infrastructure in a data center then sure you absolutely want to test your redundancy.

But with AWS it’s a checkbox. It’s transparent to you and your applications. The infrastructure to host in multiple AZs is already in place. The only real issue with MultiAZ is the failover in RDS depending on the database you use could be seconds or 10s of seconds.

fulafel · on June 9, 2022

IME not so in the real world, you'll have accidental state in your distributed system outside the DBs. You'll have some stuff that actually always runs in one AZ in normal circumstances and your integration partner in another org has whitelisted only its IP. Etc etc, everything that is untested will find ways to conspire to rot. Especially if you haven't learned by seeing these bugs so you can avoid them.

Also you won't have a clear experience and understanding of what happens in the failover and you'll won't know to avoid failover breaking mistakes in your VPC configs, security groups, frontend-backend shared state etc. (And by "you" I mean "your dev team", it's not enough that one guy gets it).

Also^2 if you read the news about all the outages it's very common for failover systems to fail generally, not just AWS - the general engineering wisdom is: always test your failovers. And there's no substitute for end-to-end testing it, instead of individually testing each layer/module. (Bad: "we can skip testing db failover", good: "let's test that the whole system works when there's a az failure")

russellendicott · on June 9, 2022

Dealing with this now for a client. Can't test Redshift AZ relocation feature because there's no way to simulate AZ failure. Only safe bet is full multi region with DNS switcheroo.

sokoloff · on June 9, 2022

Back in colo days, I saw a lot of post-mortems that read “the thing we thought was redundant wasn’t”, leading me to call them “redundant like an appendix [rather than like a kidney]”.

We instituted quarterly “game day testing” where we forcibly turned off one of the redundant items in all of our systems. It took us about 6 such cycles before these tests didn’t turn up outages that were just waiting for us.

Thinking back on those, it’s hard for me to believe that most cloud hosted companies are prepared by checking a box without actually testing.

DonHopkins · on June 9, 2022

Redundant like a Klingon penis.

https://www.inverse.com/article/41443-star-trek-klingon-peni...

philliphaydon · on June 9, 2022

> Thinking back on those, it’s hard for me to believe that most cloud hosted companies are prepared by checking a box without actually testing.

We are talking about MultiAZ, Availability Zones. Not different regions. Setting up redundancy across regions is not easy. But for majority of the people using AWS a single region with MultiAZ is good enough.

bombcar · on June 9, 2022

It's hard to simulate an AZ failing, and when it does, half the internet goes offline, so people don't really complain that much.

philliphaydon · on June 9, 2022

About 5 or 6 years ago we had an alert in the middle of the night that our RDS instance dieded. It failed over in about 15 seconds (SQL Server so it’s a bit slow compared to PostgreSQL) but the MultiAZ worked as advertised. The downside is AWS never told us why it occurred.

ceejayoz · on June 9, 2022

They don't tell you because you're not supposed to care, and there's no human involved in the process to do a post-mortem.

Something on the instance host died, most likely.

WatchDog · on June 9, 2022

I’ve seen a few AWS instance hardware failures, they happen with some regularity. You can handle single instance failure without being multi AZ. Testing an actual AZ failure, as in the whole AZ going offline or getting partitioned from the other AZs, is pretty much impossible.

fulafel · on June 11, 2022

AZs are connected via normal user visible networks, you can just break those. They even provide examples, https://github.com/awslabs/aws-well-architected-labs/tree/ma...

Those are basic (don't cover flapping or glacial-speed slowdown degradation modes, some services only, etc) but a starting point at least that can be extended.

ceejayoz · on June 11, 2022

Huh, didn't know about aws rds reboot-db-instance --force-failover

moonchrome · on June 9, 2022

>But with AWS it’s a checkbox. It’s transparent to you and your applications. The infrastructure to host in multiple AZs is already in place. The only real issue with MultiAZ is the failover in RDS depending on the database you use could be seconds or 10s of seconds.

Have you actually seen this work on your project in practice ? Like a region go down and another region picks up automatically and it kept working just by switching a checkbox ?

philliphaydon · on June 9, 2022

Multi AZ is multiple availability zones. Not multi region. Distribution over multiple regions is obviously harder than within the same region and different zones.

deathanatos · on June 9, 2022

> But with AWS it’s a checkbox.

And I'd test the checkbox all the same. We learned just this week that one of our setups, which checks the cloud-provider provided box to have its VMs distributed across 3 AZs, is susceptible to the loss of a single AZ. Why? Because the resulting VMs … aren't actually distributed across 3 AZs as requested. (The provider has "reasons" for this, but they're dumb, IMO. It should have been as easy as checking the box.)

tpetry · on June 9, 2022

Are AZ failovers not done automatically? Where can I read more about that?

fulafel · on June 11, 2022

They are from user pov just separate networks where you can deploy separate copies of services, replicated db cluster nodes &whatnot. Each service handles (knock on wood) an az becoming unreachable/slow/crazy independently. Which can become fun with service interdependencies, and a mix of your self implemented services + AWS provided ones.

There's high level description at https://aws.amazon.com/about-aws/global-infrastructure/regio...

raffraffraff · on June 9, 2022

Even if you don't want it sometimes you're forced to run multiple AZs (eg: EKS requires 2x). But that 18x figure is nuts. VPCs, AZs, subnets, IAM etc are free. The cost comes from what you deploy into them. So separate environments don't have to be as expensive as production. You can scale them to zero, use smaller computer instances, run self-managed versions of expensive stuff (like DBs) or simply run small single-instances of RDS instead of large redundant clusters. Non-prod environments they're a great place to experiment with aggressive scale-down on cost while observing performance.

cutthegrass2 · on June 9, 2022

Whilst the VPC itself is free, they get you on the NATGW - which you probably need along with the VPC in most cases.

philliphaydon · on June 9, 2022

Had to use a NATGW temporarly on AWS Lambda's to access the database and make remote http calls. But now it all works without the NATGW. Haven't had any other need for one.

ceejayoz · on June 9, 2022

IIRC, you want one per AZ, or you'll get billed for extra cross-AZ traffic when instances in zone B's private subnet use the NAT gateway in zone A.

raffraffraff · on June 14, 2022

You don't need NATGW at all if you use ipv6 egress-only gateway (which is free I believe)

arinlen · on June 9, 2022

> Why would you care about the redundancy on staging / dev?

If you deploy to multiple regions then it wouldn't make no sense at all to have a single preprod stage, specially to run integration tests.

Also, keep in mind that professional apps support localization, and localization has a significant afinity with regional deployments. I mean, if you support japanese and have a prod stage on Japan, would you think it was a good idea to run localization tests in a us deployment?

philliphaydon · on June 9, 2022

Currently have a deployment in Oregon and cloudfront servicing most of the world. With localisation. No need to deploy in Japan to support Japan. The only tricky one is China because of the firewall. Tho that hasn’t been an issue for the last few years as they haven’t been blocking cloudfront completely. (This has been running for 10 years in AWS without issue)

arinlen · on June 9, 2022

> Currently have a deployment in Oregon and cloudfront servicing most of the world. With localisation.

So you're serving static assets through CloudFront with a single backing service. Congrats, you managed to have a service that doesn't offer regional services.

Also, you definitely don't support clients in China, or enjoy shooting yourself in the foot.

Most professional applications with a global deployment and paying customers don't have the benefit of uploading HTML and calling it done.

philliphaydon · on June 9, 2022

So 10+ years of supporting broadcasters and creative agencies around the world from 1 region is not supporting those countries. Got it. I mean there’s services in those regions for the tasks performed but even then it doesn’t require the level of testing you’re assuming it does.

dageshi · on June 9, 2022

If you're deploying to multiple regions in aws then you'd presumably have to roll out the infrastructure for 3 regions yourself if you weren't? In which case I gotta assume using aws is probably a lot more straightforward straight off the bat than rolling your own solution in three different data centers?