Each Availability Zone is being rebooted on a different day. Best practices dict...

eropple · on Sept 25, 2014

One thing I'd add: best practices (IMO) dictate HA clusters as you describe, but you get a big boost to survivability by deciding on only using instance stores. Network issues have screwed EBS in the past; EBS is technically neat but very network-sensitive and it's possible to "lose" part of your EBS volume because part of the network goes away (and then your instance faceplants). Instance stores are your friend, and acutely knowing they can disappear in an eyeblink will make you design a better system. One that can survive you having your instances forcibly retired by AWS. :-)

toomuchtodo · on Sept 25, 2014

You use instance stores for persistent data. That instance disappears. Where are you restoring that data from? Either your backups are stale, or you were replicating the data or its underlying filesystem, which means you're still reliant on the network.

eropple · on Sept 25, 2014

Why would you be restoring data? The other instances in your high-availability datastore should have sufficient redundancy to keep you alive until a replacement can be spun up and brought back up to speed.

If you aren't using a high-availability datastore, I would suggest that you have not sufficiently sussed out how AWS works and probably shouldn't be using it until you do.

michaelt · on Sept 25, 2014

I'd be interested to know more about this as I've been curious for a while about how people do this stuff.

No matter how many instances you have, surely you'll still be hosed if they all go down at the same time? Or if there's rolling downtime taking out instances faster than you can bring the restarted instances up to speed?

So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?

eropple · on Sept 25, 2014

There are lots of ways to set it up. I should note first that most interesting datastores you'll run in the cloud will end up needing instance stores for performance reasons anyway--you want sequential read perf, you know?--and so this is really just extending it to other nodes that, if you're writing twelve-factor apps, should pop back up without a hitch anyway. (If you're not writing twelve-factor apps...why not?)

Straight failover, with 1:1 mirroring on all nodes? You're massively degraded, unless you've significantly overprovisioned in the happy case, but you have all your stuff. Amazon will (once it unscrews itself from the thrash) start spinning up replacement machines in healthy AZs to replace the dead machines, and if you've done it right they can come up and rejoin the cluster, getting synced back up. (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.) If you're using a quorum for leader election or you're replicating shard data, being in three AZs actually only protects you from a single AZ interruption. Amazon has lost (or partially lost, I wasn't doing AWS at the time so I'm a little fuzzy) two AZs simultaneously before, and so if you're that sensitive to the failure case you want five AZs (quorum/sharding of 3, so you can lose two). I generally go with three, because in my estimation the likelihood of two AZs going down is low enough that I'm willing to roll the dice, but reasonable people can totally differ there.

If Amazon goes down, yes, you're hosed, and you need to restore from your last S3 backup. But while that is possible, I consider that to be the least likely case (though you should have DR procedures for bringing it back, and you should test them). You have to figure out your acceptable level of risk; for mine, "total failure" is a low enough likelihood, and the rest of the Internet likely to be so completely boned that I should have time to come back online.

Thing is, EBS is not a panacea for any of this; I am pretty sure that a fast rolling bounce would leave most people not named Netflix dead on the ground and not really any better off for recovery than somebody who has to restore a backup.

toomuchtodo · on Sept 25, 2014

> (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.)

Not so much with Zookeeper, Eureka, or etcd.

eropple · on Sept 26, 2014

There are totally ways to do it, but it involves a good bit of work. I like Archaius for feeding into Zookeeper for configs (though to make it work with Play, as I have a notion to do, I have a bunch of work ahead of me...).

toomuchtodo · on Sept 25, 2014

>So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?

Correct.

omni · on Sept 25, 2014

That does seem to be the case, that is a relief.