I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and AZ-ChaosApe. Having a dedicated playground for breaking things where they can start to observe how this complex system reacts to simple failures and gaps in assumptions.
Sure. Presumably Amazon has a test lab that replicates multiple zones :) Perhaps your point is that Amazon should make this test lab public so people can contribute to the QA effort?
IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems (I'm thinking of some past failures at softlayer and other providers). I wonder if it is prohibitively expensive to do a real life system test on a big data center (or prohibitively expensive once the data center is on line). For example, how often do they turn off one of the mains (unexpectedly) to see what happens with the backup system?
It's hard to believe that they wouldn't test the mains by doing just that.
I visited the NY LaGuardia TRACON recently - built perhaps 40 or 50 years ago? - and saw the generator & battery room, where they turn off the utility power every few months just to see whether things are working. So it's not exactly a new idea or an idea that other life-&-death-mission-critical operations don't dare use.
Monthly generator testing is, and should be, standard for any data center. Same with the UPSes - monthly testing to make sure they can handle the load long enough for the generators to kick in. Throwing the switch on the mains is probably not happening anywhere on a regular basis, though. There may be "routine" events (some sort of electrical infrastructure upgrade) that causes the data center to be put onto generator power, but throwing the mains just to test is a very risky endeavor, and one that a data center provider who has very high power availability guarantees with expensive penalties, is not like likely to undertake.
Why would throwing the switch be risky? It's supposed to be HA. If it doesn't work, that's a bug, and you fix it! Just like backups are not backups until they have been restored (we verify this by making our data warehouse depend on the backup) and hot standbys aren't standbys until switched in (we do this to databases regularly.)
Netflix apparently has a chaos generator that randomly kills machines as a standard process. If you're supposed to deal with failure, make sure you're dealing with failure regularly!
> Netflix apparently has a chaos generator that randomly kills machines as a standard process.
This sounds pretty neat, but a quick Google didn't turn up any information about it besides this post. Do you know of anywhere to get more information on what they're doing? It sounds like a sensible idea, although I can only imagine trying to implement it would be ... challenging, for most companies/organizations.
I wonder if it is prohibitively expensive to do a real life system test on a big data center
It's probably prohibitively dangerous. Backup power systems don't have many-nines of reliability; generators which are reliable enough for the once-a-decade event when a car crash knocks out your utility power aren't anywhere near the reliability needed to run your datacentre for an hour every month as a test.
It claims that the generator should be run for 30 minutes every month, loaded to at least one third of the rated capacity. So testing every month is exactly what you want to do.
Also don't forget to check the fuel tanks. With the rise in fuel prices the past couple of years, theft of diesel from backup generators has become more common.
On the other hand, just as with database backups, making them is only half of the story. You have to test restores/recovery. Does your plan actually work? What have you overlooked? What edge cases do you need to accommodate?
Many data centers will test backup power generation regularly just for this reason. It's not unheard of at all and the risk of a problem at a planned time is worth the confidence in knowing that the system is more likely to work when needed at an unexpected time.
In general active engines (and fuel) don't store well. They are full of lots of seals and fluids that need to be exercised periodically to function correctly. As someone else posted generator manufacturers recommend running them once a month to keep them in good working order. The same is true of a car, leave it parked in one place for too long, and you are going to have trouble starting or driving it.
I just got my 2011 15" - opened it up to put a new SSD in it and found that the heatsink/fan combo off to the back corner was missing all the screws that would have secured it. I bet this doesn't help. I have yet to run into any issues but haven't done anything to stressful with it.
I don't think it does - it is just the SuSE venture that is being sold to VMware. I would imagine that the copyrights are held by the "rest of Novell".