Was a really helpful read actually. Currently our dependencies on AWS is.. [-] R...

mechanical_fish · on June 30, 2012

You're doing the best you can with one MySQL server.

One thing you must do is keep a MySQL data dump, as recent as you can afford (at least one per day, more often if you can; note that dumps can have nontrivial performance impacts), in an accessible location. (The easy thing to use is S3; though you're still vulnerable to Amazon outages, my empirical observation over my last three years as an AWS devops guy is that S3 rarely goes down, even when EBS or EC2 is freaking out. But the paranoid person has backups outside Amazon as well.)

Then, in an emergency, boot another MySQL server in a different zone (or even region), recover the DB, and go.

There are lots of problems with this plan. One is that it is wicked slow at the best of times: Your downtime will be measured in minutes or hours. The second is that, when AWS is having an outage, it very often becomes difficult to impossible to do anything with its API, especially launching new machines. (My hypothesis is that this is often due to the surge of thousands of people just like you, all trying to launch new machines to handle the outage.) So, again, downtime could be hours. But hours is better than days or decades.

For actual HA you must run two MySQL servers at all times, one in each of two availability zones, one of which has a slightly older copy of the data. To make "slightly older" as short as possible, most folks master-master replicate the data between the servers. But you must not write to both DB servers at the same time, so one machine will still be "in charge" of writes at any given moment, and you'll have to have a scheme for swapping that "in charge of writes" status over to the other DB when the first one fails. (I'd suggest the "human logs in with SSH and runs a script" method to start, on the assumption that you don't need HA on the time scale of thirty seconds rather than thirty minutes.)

There are several other bits of plumbing involved with replication: Setting up something like Nagios to alert you when replication breaks, learning how to rebuild replication from scratch when one server dies, et cetera. You'll want to check out Percona Toolkit. Or, though I haven't used it yet, you should read up on Percona XtraDB Cluster, which I think does all of the above and comes with the option to buy support from Percona, which has smart folks:

http://www.percona.com/software/percona-xtradb-cluster/

The next stage, if you've got money, really love setting up new tools, and laugh defiantly in the face of latency, is to try master-master replication across AWS regions using a product like Continuent Tungsten: http://continuent.com/downloads/software . But I'm not sure what it costs. Probably more than you want to spend at this point in your product lifecycle.