We established multiple datacenters across different providers with active-active DNS, so neither of them could take us down. It's not really for the faint of heart, though. We spent a fair amount of time trying to get the right services to communicate correctly across DCs.
Sure. This is a really quick off the cuff summary, so I'm gonna say some things that are loose, out of date, or maybe flat out wrong. Comments welcome. :)
The ideal situation is that both datacenters can handle your total load, but when one fails, the other doesn't explode under the thundering herd of traffic rerouted its way. So you need to plan your systems in such a way that they're elastic under load; response times rise within these limits, but you won't see outright failures.
You use a DNS failover service to provide each client with the appropriate DNS. There are various issues around caching and preferred A records--for instance, some name servers or DNS clients will pick the first A record, sorted, which can send all your traffic to one datacenter. Typically you hand out different combinations of A records depending on locality, so clients are hitting, say, the two closest datacenters to them.
When a DC fails, you remove the DNS entries which pointed to that datacenter's IPs, and lookups start returning only the known good ones. Clients which already have your multiple A records can detect the failure and fail over immediately. Where client software doesn't support that, they have to wait until DNS caching expires to get the new records.
The datacenters themselves need to contain enough of your infrastructure to function autonomously, but also should share state. Cassandra, Oracle, Riak MDC... there are lots of options out there. We were on MySQL at the time, and maintained a slave in the secondary DC which could be promoted in the event that the primary DC was, say, nuked from orbit. This system was not partition-tolerant; if the mysql link between datacenters failed, one DC would become functionally read-only. We proxied DB traffic back and forth over SSH tunnels managed by upstart init jobs. This was shockingly reliable. We actually started off using mysql's SSL support but as it turns out mysql will segfault if it gets more than, say, 8 ssl connections in a short timeframe. So we tunneled everything--redis, mysql, stats, over SSH.
The rest of the infrastructure had little shared state, so we ran the typical web stack: two identical boxes running nginx (static content) -> haproxy (load balancing) -> rails and ramaze apps spread across various boxes. Each nginx forwarded to both haproxies, both haproxies forwarded to all the app servers, so you could lose either machine in a given DC and service would keep running. We used heartbeat to manage a shared virtual IP interface between the two forwarding boxes, so you'd drop TCP conns but failover switch time was generally in the tens of milliseconds--however long it took to ifup and gratuitous-arp the rack's L3 switch.
We ran memcache independently in both DCs--since user sessions almost never switched between DCs it was OK for us to just have two distinct pools. Queues were split up as well. Some services weren't critical enough to split across DCs so we just accepted that if the primary DC died they'd be down for a few hours, until we could deploy another copy on the backup DC. Non-critical things like statistics, garbage collection, etc. Automated deployment made that a lot less painful.
I wouldn't recommend doing this at an early stage--dual environments, especially on different hardware, takes a lot of testing to get right. You have to worry about doing everything twice--two DNS zones, two Redis clusters, etc. You also have to worry about asymmetries if you're doing master-slave replication. All of this comes with operational and development overhead; your app needs to be aware that might might running in a partitioned state, that writes might take much longer than reads if you're doing master->slave across DCs, etc. I'm a strong believer in planning for that stage of your growth--but you always have to strike a balance between the ultimate reliable configuration and getting other things done.
I have been using MySQL's master-master scheme for a long while for fail-over situations. Though, my databases are relatively simplistic. The master-master thing is nice because one server uses even values on auto-increment fields and the other uses odd values. Thus, no chance for collisions if all of your tables are designed with an auto-increment id field.
Thats a great way to handle it. In our case we had an, er, extensive legacy schema to preserve. Moral of the story: plan to scale early. You don't have to actually build that scaling infrastructure ... but keep its requirements in mind.
I looked into round robin dns recently (for performance reasons, not for reliability), but decided not to go there just yet.
I'm curious - I keep hearing that some clients don't support this. Which clients are that in particular? Is it a real issue, or a rather esoteric case?
Honestly I never found a briowser that didn't support multiple a record failover. Older versions of ie mostly. You do need to be aware that many nameservers will reorder a records by integer sort or delta to their own IP, which can make your traffic pattern uneven. There are various managed DNS products to handle that, and you can build it yourself with enough time.
I agree. That was really interesting. I might not have understood half of it, but enough to get the general gist and it's really cool to read about advanced solutions like this.
And indeed you might want to consider reposting that explanation on your or your company's blog--just for showing off ;-)
I have no idea what you're talking about, but I find it very interesting. I think it would be well received if you wrote a blog post expanding on this experience.