Systems engineers, software engineers, architects, whatever. We're all in the sa...

jsmthrowaway · on May 27, 2014

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.

I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.

That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).

There's probably a broader term for operational philosophy like this.

mcherm · on May 28, 2014

...and the operations version of that is that all normal operations are performed under restricted permissions that cannot "do anything", while the full "do anything" permissions are only broken out during a major crisis.

Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.

alrs · on May 27, 2014

I tend to agree with you, with the caveat that you can't have this philosophy and sell your customers 99.999% uptime[0].

[0] http://www.joyent.com/products/compute-service/features/linu...

jsmthrowaway · on May 27, 2014

I disagree wholeheartedly. Your operational philosophy complements your SLA goals, it doesn't force them.

alrs · on May 28, 2014

I can't figure out how your comment that "understanding mistakes will happen" is compatible with 99.999% uptime.

I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.

I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.

I think Joyent has gone wrong in promoting individual instance reliability.

elijahwright · on May 28, 2014

Always ask how numbers like that are computed.

mvc · on May 28, 2014

They're not. That's a statement of what customers have enjoyed up until now. The actual SLA simply states what refund you get for each hour of downtime.