larrycatinspace's comments

larrycatinspace · on Aug 13, 2011

I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and AZ-ChaosApe. Having a dedicated playground for breaking things where they can start to observe how this complex system reacts to simple failures and gaps in assumptions.

harshaw · on Aug 13, 2011

Sure. Presumably Amazon has a test lab that replicates multiple zones :) Perhaps your point is that Amazon should make this test lab public so people can contribute to the QA effort?

IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems (I'm thinking of some past failures at softlayer and other providers). I wonder if it is prohibitively expensive to do a real life system test on a big data center (or prohibitively expensive once the data center is on line). For example, how often do they turn off one of the mains (unexpectedly) to see what happens with the backup system?

gwern · on Aug 13, 2011

It's hard to believe that they wouldn't test the mains by doing just that.

I visited the NY LaGuardia TRACON recently - built perhaps 40 or 50 years ago? - and saw the generator & battery room, where they turn off the utility power every few months just to see whether things are working. So it's not exactly a new idea or an idea that other life-&-death-mission-critical operations don't dare use.

byoung2 · on Aug 13, 2011

IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems

That happened at Rackspace a few years back: http://techcrunch.com/2009/06/30/what-went-down-at-rackspace...

I have an account with GoGrid, and they do a regular testing of their backup generators. I'm not sure if they throw the switch on the mains, though.

saetaes · on Aug 13, 2011

Monthly generator testing is, and should be, standard for any data center. Same with the UPSes - monthly testing to make sure they can handle the load long enough for the generators to kick in. Throwing the switch on the mains is probably not happening anywhere on a regular basis, though. There may be "routine" events (some sort of electrical infrastructure upgrade) that causes the data center to be put onto generator power, but throwing the mains just to test is a very risky endeavor, and one that a data center provider who has very high power availability guarantees with expensive penalties, is not like likely to undertake.

jwatte · on Aug 13, 2011

Why would throwing the switch be risky? It's supposed to be HA. If it doesn't work, that's a bug, and you fix it! Just like backups are not backups until they have been restored (we verify this by making our data warehouse depend on the backup) and hot standbys aren't standbys until switched in (we do this to databases regularly.) Netflix apparently has a chaos generator that randomly kills machines as a standard process. If you're supposed to deal with failure, make sure you're dealing with failure regularly!

Kadin · on Aug 14, 2011

> Netflix apparently has a chaos generator that randomly kills machines as a standard process.

This sounds pretty neat, but a quick Google didn't turn up any information about it besides this post. Do you know of anywhere to get more information on what they're doing? It sounds like a sensible idea, although I can only imagine trying to implement it would be ... challenging, for most companies/organizations.

dreww · on Aug 14, 2011

check out item no 3. on this list, which is AWS lessons-learned: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

see also: http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

and: http://techblog.netflix.com/2011/07/netflix-simian-army.html for the other simian themed services they've developed for care and feeding of their AWS stuff.

cperciva · on Aug 13, 2011

I wonder if it is prohibitively expensive to do a real life system test on a big data center

It's probably prohibitively dangerous. Backup power systems don't have many-nines of reliability; generators which are reliable enough for the once-a-decade event when a car crash knocks out your utility power aren't anywhere near the reliability needed to run your datacentre for an hour every month as a test.

emaste · on Aug 13, 2011

Actually, if you don't run test your generator regularly it's very unlikely to work when you do need it.

Here's a doc from cummins, a generator mfgr: http://www.cumminspower.com/www/literature/technicalpapers/P...

It claims that the generator should be run for 30 minutes every month, loaded to at least one third of the rated capacity. So testing every month is exactly what you want to do.

rdl · on Aug 13, 2011

Right, but the thing you don't test is the transfer switch/sync gear.

Powering up the generator and dumping the output as heat weekly is pretty standard practice.

ams6110 · on Aug 13, 2011

Also don't forget to check the fuel tanks. With the rise in fuel prices the past couple of years, theft of diesel from backup generators has become more common.

ams6110 · on Aug 13, 2011

On the other hand, just as with database backups, making them is only half of the story. You have to test restores/recovery. Does your plan actually work? What have you overlooked? What edge cases do you need to accommodate?

Many data centers will test backup power generation regularly just for this reason. It's not unheard of at all and the risk of a problem at a planned time is worth the confidence in knowing that the system is more likely to work when needed at an unexpected time.

Game_Ender · on Aug 13, 2011

In general active engines (and fuel) don't store well. They are full of lots of seals and fluids that need to be exercised periodically to function correctly. As someone else posted generator manufacturers recommend running them once a month to keep them in good working order. The same is true of a car, leave it parked in one place for too long, and you are going to have trouble starting or driving it.

spartango · on Aug 13, 2011

This is in the pipe.

larrycatinspace · on March 21, 2011

I just got my 2011 15" - opened it up to put a new SSD in it and found that the heatsink/fan combo off to the back corner was missing all the screws that would have secured it. I bet this doesn't help. I have yet to run into any issues but haven't done anything to stressful with it.

Anyone else with issues going to pop theirs open?

larrycatinspace · on Sept 17, 2010

Does anyone know if this includes any the Unix source or copyrights that Novell had as well?

munchhausen · on Sept 17, 2010

I don't think it does - it is just the SuSE venture that is being sold to VMware. I would imagine that the copyrights are held by the "rest of Novell".

nailer · on Sept 17, 2010

Would they keep the Unix-like part of the business together?