The hosting wars are just starting. There are so many features no one is offering right now. E.g. I want to host my app and data in my own data center, but have an automatic fail over to a cloud provider.
A few issues/questions with that:
1) Local traffic: if this were an enterprise app hosted on premise, with cloud backup, but with some remote users, you'd need to be careful in network configuration. With firewalls, people also access local apps on different IPs than from outside. This may not apply, and certainly can be resolved, but would require some visibility for the provider into your network configuration.
2) Failover: To do this really easily as a service, you'd ideally run 100% of traffic through a third party service. Some of it would get directed into EC2/cloud provider of choice, and some into your colo datacenter. If you're willing to use something like CloudFlare for DDoS/etc. prevention now, this would be the same compromise.
3) Database: Dealing with all the replication issues. If it's static content, this is trivial. Otherwise, you have a consistency problem for databases, etc., and most cloud providers (especially Amazon) want to lock you into weird onsite proprietary data solutions which don't
The best system today is probably to host at a facility with great transport to Amazon Direct Connect nodes (e.g. SV1/SV5 in Silicon Valley -- I'm setting this up now), so you've got fast cheap ways to keep your databases replicated between AWS and the free world/your own colo.
For inbound traffic, for $20/mo, I'd just do Cloudflare; for higher end, you'd have a lot of choices to make. (I haven't tried the higher end cloudflare offerings yet; they do seem to address most of the shortcomings, and are still pretty cheap).
I really hate how confident tech people about shit they've never done and/or just read about on blogs. I understand that apparent confidence is a currency but it's not the only one.
I don't think you need to be 100% precise on every hn comment.
You can do DNS based failover (short TTL), manually or automatically, and it's the easiest to set up using entirely your own infrastructure. This works great if you can tolerate a variable length outage for customers, or where you're not using it to deal with outage, but rather just migrating load -- say provider A suddenly gets expensive, you can migrate away, but don't need to hard kill provider A, at least until any DNS cache has expired.
You can do IP based failover (various techniques -- anycast, which doesn't really work for most apps, making your own announcements of the same netblock, IP address failover below the BGP level/internal to a network, arp stealing on a subnet (not useful across providers but good for HA), etc.
You can use a smart proxy in front of your app (an F5 "Global Load Balancer", something you've developed yourself, nginx with minimal state, or an inexpensive service like Cloudflare or their 1000x more expensive Prolexic competition).
You can do the best thing for non-web apps, a smart client, which knows to go down a list of servers (randomly?) and find the closest or best one. More intelligence in the client = more better.
I've set up all of these except Anycast (which I'd actually love to do sometime, but RIPE jacked my /24) and Prolexic (because I don't want to spend $30-100k/mo). Which is the best really expends, but IMO at least having a plan (even if it takes a week) to switch hosting providers is worthwhile for everyone.
His first example is round robin DNS. Sorry, but the terms DNS failover and round robin are often used interchangeably when you're dealing with business continuity.
Yes, certainly there are other options that increase the complexity, but why not start there? While the impact to users on shitty ISPs or behind proxies is unfortunate, it is relatively easy to implement and low-cost.
Going beyond that increases the complexity and cost exponentially and is certainly not easy.
Yeah, RR is strictly returning a response of a set (>1 records, ideally) to select from each time. Being able to remove those entries based on outages (which really needs a short TTL) is an optimization.
Unfortunately some stupid resolvers cache a single answer set for a long time, but for some applications, you're willing to accept 1/x of attempts fail during an outage (just hit reload, or come back in a bit), since this costs ~nothing to implement.
The basic concept works great for NS, MX, and other protocols where they're designed to retry.