I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not.
When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down.
Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent[1] three 9 reliability, with good failover modes, your expected downtime is under a second per year [2]. Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist.
The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.
[1] A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.
[2] DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.
Most clients don't understand the R2D2 talk. They understand money, features, bugs, and downtime. So I've always explained it like so:
Uptime beyond 95% costs lots and lots of money. Magnitudes of order of more money. It requires redundant equipment, engineering all of the automatic failovers at every layer, lots of monitoring, and 24/7 technical staff to watch everything like a hawk. Not ... in ... your ... budget.
... or you could rest in the comfort of knowing that services like Twitter have achieved mammoth success despite long and embarrassing outages. I thought you might you see it that way. Good choice.
First of all, uptime of 95% means 18 and a quarter entire days of downtime per year. That's horrendous. I wouldn't host my dog's website on a server with that kind of SLA - and I don't even have a dog.
Secondly, although Twitter got away with large helpings of downtime, that doesn't mean that every business type can. Twitter is not (or at least was not, for most of its existence) business-critical to anyone. If Twitter goes down, oh well. Shucks.
If you're running recently-featured-on-HN Stripe, however, where thousands or more other businesses depend on you keeping your servers up to make their money, I'd say even 10 minutes of downtime is unacceptable.
Finally, this doesn't have to cost a lot. Just find a host that offers the SLA you're looking for, and have a reasonably fast failover to another similar host somewhere else.
The definition of "uptime", the host SLAs only covers network uptime and environment uptime, but clients consider "uptime" to mean application level uptime, which includes downtimes for server maintenance, steady-state backups, backup restores, deploying new releases, etc ... anything other than service is not 100% fully functional = downtime in their minds.
Also for costs, "reasonably fast failover to another similar host" implies live redundant equipment at another host which is double the hosting costs, that's a big pill to swallow, so big that most orgs would rather suffer the downtime when they see the real cost of full redundancy.
> First of all, uptime of 95% means 18 and a quarter entire days of downtime per year. That's horrendous.
It may be acceptable to some clients depending on what other provisions are part of the SLA (though probably not with as little as 95%).
I've seen a 98% SLA which was applied both annually (~7-ana-third days) and daily (2% of a day being about half an hour) with significant remuneration if the daily SLA was not kept as well as the annual one. If I remember rightly, maintenance windows counted against the SLA except in certain (specified in the contract) circumstances.
Of course for many applications this would still be completely unacceptable, but for others it might be fine depending on the costs and the comeback if the SLA is broken.
I'm getting better than 95% up time on my home network. If you told me that and I was your client, I'd be going elsewhere.
Over 99% costs lots of money, yes. How much is dependent on how close to 100 you are looking to get, but that's the client's decision. 99% though is a perfectly acceptable standard.
Now, you are right that you get diminishing returns as you add more nines, but 95% is still in the area where there are a lot of cheap things you can to to increase uptime; RAID, a ups, and even a low-end but business-grade connection should get you around two nines.
I have a SLA of 99.5% (over a month) on a low-end setup[1] and it's fairly rare that I don't meet it, even including planned downtime and network outages due to DoS or mistakes of my upstream.
[1]I use RAID and mostly supermicro server grade hardware with ecc ram, but there is no failover across servers; I'm in a data center, though it's a low-end data center with a low-end bandwidth provider.
Maybe it would be good to write the custom an email explaining stuff like
99%=Well run server
99.9%=Multiple backups, will cover most hardware failures
99.99%=Top grade commercial
99.999%=What companies like Google or Yahoo can achieve
99.9999%=Hopefully the US strategic defense systems are this reliable
Also, you're assumption about the servers are entirely independant. That's a reasonable assumption in terms of fires and blackouts and floods, but not for software problems. You really can't assume that unless the servers are all running entirely different software stacks on different operating systems.
99.999%=What companies like Google or Yahoo can achieve
Even worse. Five nines is five minutes of downtime per year. The core Google search experience blew five nines for half a decade with just one outage -- the one where they marked the entire Internet as a malware site, which took somewhere like 40 minutes to address.
This kind of thing makes me dismiss talk of nines as fetishism or sales-speak. You can say your system is going to have five nines of uptime at the application level. You're probably lying.
P.S. Pricing-wise, a client who wants > 99.5% either wants to pay mid-six figures (and up up up) or they want something which is deeply irrational for you to offer.
If we're talking about the agreement, SLAs that are better than 99.9% are quite common, and available even on low end products. The problem is that the SLA payout is usually "we refund you for the time you were down, if you ask for the refund" - heck, with that payout, I'd be happy to give you a 100% SLA on any product I sell. (of course, I'm not going to advertise as such; the sort of people who buy from me would find that disingenuous.)
That said, I think you are about right with 99.5% being about the best you can expect while spending a reasonable amount of money. (especially for a static site, i think a few more tenths of a point is possible for less money than you think, but the cost curve goes parabolic sometime after 99.5%)
99.999% is the standard for landline telephones, which I think is a pretty decent analogy here. Unless the customer has some crummy VoIP solution and doesn't trust their phones anymore :P.
I believe it's what they design for at the central switch nodes. The buildings would have entire rooms if not floors dedicated to nothing but 48 volt wet-cell batteries.
Of course the "last mile" infrastructure is not so reliable. That said, in over 20 years I've had POTS service, I can't ever remember not having dial tone when I lifted the handset.
But "three 9 reliability" is still not the same thing as 100%. The contractor has a right to be concerned about the 100% figure making it's way into a contract.
Um, things don't just "make their way" into contracts. Yes, they suddenly appear in drafts, but finals? Sorry, no. Finals require approvals and signatures from the people who are going to be on the hook. (The e-Bay lawyers who approved the Skype purchase may beg to differ, but they're hardly unbiased.)
The draft-stage is where you take snorkel's exactly-right advice, and declare that the difference between 99.999% and 100% is about a bajillion dollars, give-or-take. Go a step further, and sell that 0.001% hard by pointing out just how much they're be prepared for ("multiple earthquakes plus a giant robot attack, all at once!"). They'll start rethinking fast - guaranteed.
And that's the essence of diplomacy; letting the other guy have it your way. If he thinks it's his idea, even better.
1-.001^3 = .999999999, which is under a second expected per year, which the client will never notice even with good monitoring tools, and therefore will never invoke the contract.
Your assuming independence to a level that does not exist. Consider a Y2K style bug in the OS would could take down all severs for an extended period of time. Or someone could write a virus that uses a zero day exploit etc.
I am making no such assumption. See [1] in my original post. I already talked about intersection. Feel free to add Y3K to the list of nuclear war, chinese hackers, etc. The intersection is incredibly small, and not something that I am going to include in my back-of-the-napkin calculation.
For enough time and money dumped into code auditing and hiring smart people, no. Not that most companies should do that, but it is possible if you want to pay for it. Most companies (rightly) prioritize innovation, scalability and profit margins over absolute reliability.
How many of those top sites actually prioritized reliability? Is it even justifiable for their business models? I bet you can find a lot better reliability engineering in bank, credit, and stock systems. For example, when was the last time the Visa credit network crashed (as a whole, not localized outages)? Nasdaq?
Aiming at 100% uptime opens up all sorts of scope issues. Consider that 'failover' does technically cause a small amount of downtime as you restart the session. If you acknowledge that, it throws out most of the current model of fault tolerance from helping you acheive 100% uptime.
It's not different from any other system really. Try designing a car that can drive 100% of the time? Or a power grid that's up 100% of the time.
100% uptime is not an operational requirement – it's contractual. A client that demands 100% uptime isn't being unreasonable; they're looking for a contract remedy (most likely a termination right) if/when the site goes down.
1. "Uptime" is defined in many, many ways. In the OP's article, it's the definition of uptime that seems unreasonable. Normally, the demarc points for the network segments and equipment being measured for uptime are entirely within the provider's control. In the OP article's update, the client clarified that 100% uptime only applies when hosting is cut over to the provider's site – something they are (theoretically) capable of controlling.
2. Remedies for failing the uptime requirement are different for nearly every agreement. Often SLA credits are the exclusive remedy. Sometimes the customer has a termination right (either express, or through the termination for cause provision). The remedy is probably more important than the uptime percentage.
You'd be surprised how many big name web apps offer 100% uptime as a matter of contract, knowing that it's a near-impossible operational goal. It's a matter of taking on the risk of your customer leaving you or claiming SLA credits, or whatever remedies you agree upon.
EDIT: I've represented a whole lot of customers of web services over the years (IAAL). The big lesson in this area for me is: (1) customers rarely invoke SLA credits, preferring instead to "work it out" at the relationship level, and (2) most provider off-the-shelf SLAs are so full of holes and tricky thresholds that they are effectively useless. On this last point, beware the 100% or five nines or other unreasonably high uptime commitment. When you get into the details of the SLA (the demarcs, the qualifications for obtaining credits, the remedies for failure), you will almost always find that there is no realistic remedy at all.
N.B. Meant to edit my comment above, not self-respond.
Laypersons often misunderstand that there's a difference between promising 100.00000% and delivering 100.00000%.
The client needs to understand what their contractual remedy is when the promise falls short, and the method that will be used for evaluating the difference between the promise and the delivery.
It's always helpful when clients give unambiguous signs of unreasonable insanity upfront instead of hiding it until you're halfway through the project. It makes running away as far and fast as humanly possible so much easier.
There also was this comment on the SO thread: I would personally RUN from this client as fast as possible. I suspect this won't be the last crazy idea they may have (from a technology standpoint).
Why run though? They probably just don't understand what 100% means and it just takes you explaining it to them. Or simply state that you cannot meet that requirement and see if they still want you to bid on the project.
You've just quoted the reason why: it won't be the last crazy idea they may have.
That's pretty much an absolute certainty. Even if you can convince them with reasonable arguments to accept a few point less uptime, you're going to be having the same kind of discussion many times after on different subjects.
You have to be really, really sure you want and need this kind of client. Most of the time (around 100%...) they are more trouble than they are worth.
All the posters are stuck on the fact that 100% availability is impossible. But why not instead try to learn from others who offer 100% availability, like Rackspace and SoftLayer? These (legitimate) providers know 100% availability is not possible, but they guarantee it anyway. How can they get away with this? Easy, they have a contractual SLA that indicates what their clients are entitled to when their network fails for any period of time. Further, neither is a low-cost provider, which allows them to engineer their systems to reduce incidences when clients will invoke the SLA.
Note that this doesn't mean that Rackspace is shady because they promise 100% knowing they can't deliver it. After all, they put their money where their mouth is! They have an incentive to actually achieve 100% uptime. I'm sure there are other applications where the target is 100% (not 5 nines) availability, especially in finance, medicine, and militaries.
My recommendation would be to take your engineering hat off, replace it with a business hat, and provide them with a series of price quotes for various uptime SLAs. And then make sure you're pricing high enough that when something goes down for any period of time that you can make good on your obligations under the SLA without losing too much sleep. Then let the client choose the SLA that matches their business needs and budget.
Another option not mentioned in the thread is to accept it, and pay any of the fines associated with not meeting it. This happens all the time in public tenders and contracts, where the fines are calculated into the business risk. It does mean that the organization needs to set the right fines to make that unfeasible.
I've definitely seen hosting services do this. Sure, there's a 100% SLA, but if you actually read it, it says you get back the pro-rated monthly fee for the time it was down. So, in other words, you don't have to pay for it when it isn't working. Not much of an SLA.
Public transport does this too: in order to provide a robust, perfect implementation of a schedule, you need extra busses/trains/street cars (expensive capital goods) and extra manhours (generally most expensive part of the operation). Rather than invest too much money in making sure the schedule can be met, it's cheaper to pay fines when there are delays.
I've actually had quite good experiences claiming ticket credits or total refunds for some long-distance UK train journeys. I can't find the exact terms, and they potentially vary per-operator, but it's around 50% refund for up to 30 minutes delay, and a full ticket cost refund if it's >60 minutes. A couple of times, I've had a trip delayed by 50-59 minutes. I suspect this isn't a coincidence.
Their craziness doesn't matter. Usually crazy customers aren't rich. So if you build to their craziness, you'll lose the customer.
You need to build an appropriate infrastructure that will win the bid, figure out what you can achieve (99.9%/99.99% uptime) and build in enough overhead to cover your SLA penalties. Or negotiate a monitoring methodology that is in your favor. (ie. exclude planned maintenance windows, use a monitoring threshold/interval to allow you to address issues before triggering contract "downtime", exclude external provider issues, etc)
On a personal level, I think individual people who've become wealthy have a more reasonable outlook when it comes to stuff like this, but this doesn't apply to rich _companies_.
The least reasonable clients I've ever had were employees of large companies, and more specifically those who'd been recently empowered with the responsibility of the project I was working on and lacked the perspective required to realize that crazy doesn't help anyone.
It's not their money, but it's their decision, and that's a petri dish for crazy.
First we're going to have to get the governments of the world together to agree to remove all nuclear weapons. Second would be getting that asteroid tracking and deflection system working. Quantum physics does unfortunately predict that the earth might flick out of existence with some small probability, but by distributing the website across the universe we can reduce this probability arbitrarily (and numbers approaching p=0.9999.. are the same as p=1). The client is going to need to budget for this.
So 100% uptime is really difficult to achieve, hardware wise. Software wise you'll have to prove that there are no bugs in the system that might bring it down. That is much, much, harder.
You can have 100 servers in 100 different countries and have the client automatically change to another server if the one they are connected to goes down. But if there is a software bug that crashes all your clients on start-up, or worse, crashes all your servers (think what happened to Skype not long ago).
Also, never underestimate bugs in hardware (pentium 1). You'll need multiple locations, multiple hardware, multiple operating systems, multiple compilers, multiple versions of the software.... Standardizing on one of these components may bring down your entire system!
Look at F5 networks Global Traffic Manager. It's really just a fancy DNS server. You set your TTL (time to live) down to just a few seconds and it monitors your main and standby sites. If one of the sites goes down it changes your A records to point to the new site. It can even do load balancing across sites based on response time or number of connections.
They are expensive, but this is how large companies like Yahoo keep close to 100% uptime.
Explain to the customer that even with a hot site, the failover can take a few minutes. Also, some ISPs don't honor TTL and cache DNS queries for longer than they should. The Internet isn't perfect, and usually each extra 9 you add is around 4x more cost.
This is the kind of service I'd love to attack as a side project, it's fascinating. Though I'm sure someone out there reading this, has something like this service but affordable for startups?
There is a lot of room in this market for competition from open source projects. Really, the concept is so simple that it could be done with a shell script for simple failover:
(pseudo code)
if (curl http://ip of main site) fails then
copy alternate zone file to bind dir
service named reload
fi
You get the idea... F5 Networks is really just a fancy DNS server running on a BSD based OS on an x86 appliance. Zeus, which another commentor mentioned, has an AWS version and will let you run it on your own hardware if you like.
I'd love to see some open source competition for this space, or even low price competition.
>I'd love to see some open source competition for this space, or even low price competition.
Yes, that's exactly what I meant. Running our own failover system is not just expensive, but time-consuming - just another thing to do when you are trying to scale and time is already short within a small team.
Several DNS hosting services do this at varying costs. DurableDNS (which I founded but have sold) does it at a low cost. It's fairly trivial softwarewise as long as you have the redundant hardware, DCs, etc.
These claims are common for all the big CDNs and ISPs but it's always accompanied by half a page of fineprint that rids them of any liability when an outage happens and limits compensation to a microscopic penalty (usually a fraction of the monthly fee).
You can negotiate steeper penalties with funky multiplicators - but they make you pay through the nose for such an arrangement (for obvious reasons).
Was it a promise of future reliability or a statement about historical reliability over some time frame? Because everything has been working 100% reliably until it fails...
Seems reasonable, actually. "100%" is obviously not going to be achievable, but "external users should be ok if our office network fails" is not necessarily a bad requirement. There are lots of things that may make this client happy: a VPN to an "internal" server in an external data center, synchronous replication, etc.
Look at it from a business, rather than engineering perspective. Forget the achievability of the 100% target for a moment -- what target can you realistically achieve? Then, what does the contract say the remedy is for breach? As long as the remedy is not huge and -- this is very important -- is clearly quantifiable, it may be just fine to enter into such an agreement.
You need the remedy to be clearly quantifiable (X dollars per Y minutes, for example) because otherwise you create an opportunity for dispute when the inevitable occurs and you breach. Resolving such a dispute could very well cost more than the remedy itself, even in the worst case.
From an ethical standpoint, I would only enter into such an agreement with an understanding that "while we agree that it makes sense for you to request that target, we think realistically that we'll be closer to 99.9% (or whatever you truly believe)". Entering into an agreement with a 100% uptime clause is different from setting an expectation that uptime will actually be 100%.
Helping your customers understand what they actually want to buy is part of selling, surely? Things are made trickier by PHBs in the client company claiming that everything is mission critical and that they can never ever have any downtime ever for any reason. Educating these people about, for example, just how flaky email and dns can be is important for your sanity.
Looks like client is asking for off-site failover, not really 100% uptime and the OP doesn't know how to achieve it over a WAN. Esp if this is a real enterprise customer, they want Disaster Recovery (DR).
This is a solved problem, albeit not a commonly known solution. Any of F5, Radware, and other expensive boxes can do this. This can also be done through DNS or with HA-Proxy etc.
Offer them a 100% up-time guarantee for a year if they also promise to avoid being sick for the entire next year. If they can't avoid succumbing to a virus, why would they expect your service to avoid it (or any other sort of bug)?
I haven't read any cases on this, but I imagine most courts would truncate/round down the achieved performance number. If the contract says 100%, that wouldn't allow for any downtime whatsoever.
It might be different for lower percentages - getting 89.8% performance where 90% is called for could be a de minimis breach and not actually count as breaking the contract. Definitely curious as to whether anyone has more to add on this.
They are interpreted the same way other commercial agreements are. They're also generally very lengthy and specify what they mean (i.e. what counts, what doesn't). They also set out what the consequences of breach are (maybe you want $, maybe you want something else, etc.).
There's no magic to writing "SLA" or something else. What you put in the contract is what you'll be held to...
100% uptime is unrealistic for big companies because at scale, it costs a lot. For example, replicating is expensive with transmission, storage and maintenance costs.
When the Amazon incident happened, I did an analysis and found that the cost about triples if stored in an external data centre. Almost 6x cost if hosting overseas despite same company.
I then understood why companies like Reddit do not aim for 100% uptime possible beyond the data centre. It depends on how much the customer (or client) is willing to pay that determines the uptime aim (I think Reddit's aim for example is 90% at least).
Let's say I wanted 100% reliable music listening. To do this, I buy a million of the original 30GB Zune media players, create a perfect failover system, so that if the sound from one of those stops for whatever reason (hardware, software, cosmic rays, etc), it'll switch to another one. I even move these Zunes all across the world, with AC provided, and multiple network links linking all of them, with satellite link backups between them.
Then December 31, 2008 rolls around, and a tiny firmware bug knocks out all of them simultaneously for 24 hours. Oops.
Would a heavyweight client with nothing but static data and no network at all reach 100% uptime, from a contractual point of view? Even a wrist watch does not exactly guarantee 100% uptime.
I can't post on serverfault, since the question's been locked, so I'll put useful things to consider here:
* 100% SLA doesn't always mean 'It has to be up all the time'. Depending on the customer or the supplier, it can mean 'We'll aim to have it up all the time, but if it's not, we'll pay you compensation according to a predefined scale'. Clearly, in this case, you need to define quite firmly what 'up' and 'down' mean, how you measure them, how you time them, and how you decide what compensation to pay.
* DNS failover or load balancing is often nearly good enough. It won't get you instantaneous failover, since you'll need to have a finite (albeit small) TTL, and some client stub resolver libraries cache stuff anyway in violation of the TTL. But it's an easy step on the way
* If you want true 100% uptime, ultimately, you need a single IP (or range of IPs) which will be permanently reachable. That pretty much means the IPs need to come from one AS number - in other words, one ISP or one company.
* You can choose an ISP or company which has multiple internet connections, peers with a lot of people in multiple locations, and has a well-designed network such that you feel confident they won't go offline. Amazon may be a good example, but they've had several recent high profile failures!
* You could do it yourself - in which case, you'd need to become an ISP, get your own AS number, and set up peering arrangements with multiple suppliers in multiple locations. This can be very costly, and you still have to run a network and servers yourself in a reliable way
* You might be able to find a supplier who peers in multiple locations, and anycasts their protected IPs within their AS. That way, the same IP comes from multiple locations and should be reliable. Akamai might do something similar to this, I think.
* Ultimately, however you do it, you'll have a very difficult time making it impossible for it to fail. You're into the game of making it exponentially less and less likely that it'll fail, but you can't eliminate all risk. At the end of the day, your contract with your customer needs to define what happens if you should fail to reach 100% uptime. Is it breach of contract? Or do you need to pay a penalty fee? In either case, however you host it, you ideally need to make sure your suppliers compensation to you if they have a failure will cover the losses you incur.
Many clients ask for the following without knowing better:
- 100 pct uptime
- Zero defects
- Zero scope changes
- Zero perceptible latency in all cases
It is up to the professional specialist to educate the client in terms they understand. Only after explaining in terms they understand can you call them unreasonable. If the client understands and is still unreasonable, the true professional has the obligation to walk away.
Various DNS caching architectures deployed by ISPs will sometimes strip additional addresses, or choose one for you themselves. That means if you're unlucky, that selected address will be down. Even if they do handle DNS correctly, hitting a dead server and failing over to the next one (possibly also dead) can take some time and may be interpreted as service not working.
Unless you control the whole path to the customer and his client application, DNS is not a solution for either HA or LB. (found out after trying to come up with something close to 100% for voip network - LB with 2 DNS entries on public internet will give you split close to 70-30)
Another question is how is the downtime interpreted? If you're halfway through some transaction / flow and the site you're talking to goes down, should your flow (shopping basket for example) be available on the failover site?
You can do this, but I suspect they might not like the cost quote of $10 trillion per year, and they 10 years of lead time it will cost you to build a worldwide network of secure underground bunkers with their own air, water, food and energy supplies, and a hardened shadow internet that duplicates the function of the existing internet.
(Embarrased) Yes of course it is. And that's coming from someone who presumes to know operator precedence and despises programmers who use extra parens for no good reason.
This is not a technical issue, this is a contractual issue. You want to sign a contract that pays reasonable amounts for reasonable (far less than 100%) reliability for every minute of downtime.
Then you have the impetus to minimize (but not eliminate) those minutes to a reasonable level.
And then the client gives you a cheque that says "infinity" on it. They have fulfilled their end of the deal, now you must. What, your bank won't accept that cheque? Not our problem, get back to work.
I see several possible approaches, if you really want to have that client.
This easiest would be to just talk to them, try to find out what that "100%" is actually REALLY all about and make them understand that from a technical point of view, 100% will add a lot of things to the project budget. A "100%" demand in a smaller project for a typical small-to-medium business will likely mean something different than "100%" in a project for the NYSE. So, talk to the customer and find out what it is actually all about and then plan and quote according to their actual needs. So, this makes it more a requirements-engineering type of problem, not necessarily a hacker problem.
Or you just say "yes, of course" and tell them how super reliable the system is and then let the guys in legal work it out in the fine print and cover your ass.. but don't expect much happiness and continued business from that client then once they find out what's going on.
But, in a more honest approach, maybe this is actually all they really want and need? Maybe it actually is enough for them to have someone to blame and pay some penalties for violating SLAs. Again, you need to find that out.
Not a typical hacker-hacker-problem but surely an issue a hacker would typically encounter, even on a daily basis, and should learn to deal with.
When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down.
Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent[1] three 9 reliability, with good failover modes, your expected downtime is under a second per year [2]. Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist.
The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.
[1] A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.
[2] DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.