Joyent us-east-1 rebooted due to operator error

bcantrill · on May 27, 2014

It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).

mixologic · on May 27, 2014

I feel bad for the person who made the mistake. Even though its obviously a systemic problem, and highly unlikely to be an act of negligence, Im sure he/she doesnt feel too hot right now.

jsmthrowaway · on May 27, 2014

It's operations. You fuck up, you suck it up, you fix it, then (and this is the important part) you prevent it from ever happening again. Feeling like shit for bringing something down is a good way to give yourself depression, given how often you will screw the pooch with root. In the same vein, anybody who says they'd fire the operator without any qualification on that remark should be given a wide berth.

People tend to forget that "fixing it" isn't just technical, it involves process, too. Every new hire that whines about change control and downtime windows would be the first to suggest them, were they troubleshooting the outage that demonstrated the need.

myrandomcomment · on May 27, 2014

Back in the day we used to say there are 2 types of network engineers, those that have dropped a backbone and those that will drop a backbone.

stephengillie · on May 28, 2014

Your management is failing your newer engineers, if this is still the case.

macavity23 · on May 28, 2014

Nonsense. Someone has to be operating at the sharp end of the enable prompt, and sooner or later it'll be 0330 and that person will type Ethernet0 when they meant Ethernet1, whatever management you have in place.

When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.

Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.

LnxPrgr3 · on May 28, 2014

I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.

Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.

I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.

Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!

Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.

socceroos · on May 27, 2014

My thoughts exactly. Poor fella.

I've seen worse though. A newish officer spilled his morning coffee into the circuitry of a device worth over 10 zeros. Immediately short circuited.

themodelplumber · on May 27, 2014

Wow. Did this gold-plated B-2 bomber still fly after the coffee incident?

jsmthrowaway · on May 27, 2014

You know you could build five space shuttles with ten zeros, right? Are we talking dollars or Yen?

dsberkholz · on May 28, 2014

Must be counting the two after the decimal point. =)

stephengillie · on May 28, 2014

Why are you using a throwaway account? Ohh, I just saw the "dollars or Yen" remark. TIL we use throwaway accounts for the times we feel like being assholes, so the non-elites can't track it back to our physical neuroprocessors.

jsmthrowaway · on May 28, 2014

I'm not sure why I'm dignifying this Reddit drivel with a response, but my karma and account age should be your hint that you're barking up the wrong tree.

stephengillie · on May 28, 2014

"check my previous responses and my credit score for how you should treat me" ohh what an old-man response. It's too bad the Imgur "downvote everything they ever posted" script doesn't work here on HN, now isn't it?

My account's karma and history exceed your account's on this site, and even worse, this individual comment bears more value than yours! Ooh burn!

disillusioned · on May 28, 2014

Haha, this is a pretty spectacular amount of cognitive dissonance you're demonstrating here.

Let's post-mortem this lunacy:

1) You misinterpret jsmthrowaway's initial comment as vaguely racist (or something), notice the word "throwaway" in his user name and get really excited that you can stand on your high horse and call him out for hiding behind the shield of internet anonymity when he wants to be a (you think) racist idiot. Even though his comment is a legitimate currency conversion remark. See: https://www.google.com/search?q=1+usd+in+yen

2) He explains that clearly you're mistaken (which, really, seriously, what a kneejerk response from someone who just wanted to show how clever they were, calling out an "asshole") and further explains his account isn't even a "throwaway" in the traditional, trolly sense of the word, citing his account age and the fact that he regularly actively posts to the account.

3) You, caught perhaps in a moment of clarity, though I think I give you too much credit here, realize you were too eager to pounce on the "asshole" for his "Yen" remark, and perhaps you misread it. Your latent erection fading, you counter by explaining that your history and karma are even _more_ impressive, somehow completely avoiding taking responsibility for a completely nonsensical leap in logic and accusation of wrongdoing, while doubling down on your cognitive dissonance.

4) The Aristocrats!

Gigablah · on May 28, 2014

It seems he has a bit of misplaced sensitivity when it comes to the Japanese: https://news.ycombinator.com/item?id=7555232

stephengillie · on May 28, 2014

I was wildly drunk when posting this. I didn't remember doing so until I checked HN just now.

chris_wot · on May 29, 2014

Uh, you just called him out for using a throwaway account, now age and experience means nothing? Go away, troll!

pdeva1 · on May 27, 2014

if something worth over 10 zeroes can be destroyed with a coffee spill, i would say it had it coming

mpyne · on May 28, 2014

As a one-time newish officer who used to be in charge of things with many zeroes, I'd be inclined to agree.

malchow · on May 27, 2014

How many non-zero integers were in the price of the device, though? My computer is also worth more than 10 zeros.

stephengillie · on May 28, 2014

Please keep in mind that "price" means "how many dollars other humans are willing to trade for it right now"; not necessarily any concrete evaluation of the device's functionality compared to a human competitor or human operator...

stephengillie · on May 28, 2014

DOWNVOTE FOR TRUTH

linker3000 · on May 27, 2014

...then there was the new server room that was built with one of the 'big red buttons' conveniently placed behind the pull cord for the lights.

Why, yes...a couple of times...before a perspex arch was less-than-hastily fixed over the button..

xeroxmalf · on May 27, 2014

Seems like not allowing food or drink near a device worth over 10 zeroes would be a no-brainer, but hindsight is tricky like that.

tedsanders · on May 27, 2014

That sounds so awful. I can't imagine living the rest of my life knowing that I had been a net negative in the world. All of my life's earnings would just be a partial restitution of that one second of destruction.

JetSpiegel · on May 27, 2014

If you have an EXTREMELY reductive point of view, that equates revenue with human worth.

thund · on May 27, 2014

supposedly 0.0000000001 billions of $

shit happens, design for the worst.

stephengillie · on May 28, 2014

"If you reach for a star, and come up with a handful of mud..."

ddorian43 · on May 27, 2014

really? what can cost that much ?

defen · on May 27, 2014

The USS Gerald Ford cost 12.8 billion to construct + 4.7 billion in R&D ... I think we would have heard if it had been destroyed by a cup of coffee.

nsxwolf · on May 27, 2014

... And be utterly destroyed by a single cup of coffee?

kisielk · on May 27, 2014

A quantum computer.

akerl_ · on May 27, 2014

As a request: It looks like each time the status page is updated, the old UPDATE: <words> is removed. For the future, it would be great if the older updates were preserved so that people looking back could understand the chain of events, rather than just seeing the first / last pieces.

lukasm · on May 27, 2014

Mandatory DevOps Borat

"To make error is human. To propagate error to all server in automatic way is #devops"

and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."

alrs · on May 27, 2014

Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.

It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.

I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.

99.999 sounds stuck-in-the-90s.

evan_ · on May 27, 2014

> sounds like they're throwing a sysadmin under the bus

at least they didn't name the operator in question...

elijahwright · on May 27, 2014

Our internal culture is such that everyone on the team would rather be blamed for something than accuse someone else of doing it. That's shitty, and not something you do to someone. You fix the problem and then you move on.

If it makes you happy, blame me - I don't mind.

sokoloff · on May 27, 2014

At my $DAYJOB, we are always careful to figure out exactly what happened, including by whom. It's not to assign personal blame, but I believe it's critical that everyone agrees on the facts (who, what, when, where, and [if possible] why).

Response and conversation is always focused on "how do we prevent this in the future?", not on punishing whoever was involved in the past.

IOW, I agree with I believe is your intent, but differ on the implementation. Blameless transparency is the term we use (and we probably stole that from somewhere else).

It's a very powerful signal to the whole team when you first see individuals "admitting" to exactly what they did, how it caused or contributed to the outage, and to hear them thanked for their contribution of understanding in the post-mortem.

Senior leadership (including myself, who originally instituted the entire process a decade ago) is very clear that we want to know the facts and that in seeking and using those facts, we're only focused on the future, no matter how boneheaded the individual actions appear with the benefit of hindsight and knowledge that they'd lead (in)directly to an outage. I run operations and also participate in the promotion discussions for all technologists, and in 11 years, I've never heard a negative shadow cast onto a sysadmin/sysengineer from their actions during or leading to a production outage. And we've (collectively) made our fair share of mistakes over the years. That doesn't stop good employees from feeling bad about it, but that's a personal feeling they have, not from the fear of it being a professional black mark.

M2Ys4U · on May 28, 2014

I think there's a difference in how you approach this with an internal-facing view and an external one.

Internally, You're right. But externally the company fucked up, not the individual.

sokoloff · on May 28, 2014

100% agree, and it is my oversight to not draw that distinction more clearly. We have the luxury (so far) of only reporting internally.

hack_edu · on May 27, 2014

BTW, this is the right way to do it. :)

niels_olson · on May 27, 2014

"elijahwright" shall henceforth be used in place of "scapegoat"

elijahwright · on May 27, 2014

Awesome! It's what I've always wanted!!!

arakkisu · on May 27, 2014

it was that way at tech, no reason for it to change now

elijahwright · on May 27, 2014

Now I have to figure out who you are. :-)

knodi · on May 27, 2014

Sure blame on the engineers. You give power, people use it badly blame the engineer for giving too much power. You don't give enough power sysadmins/users bitch and yell why don't we have enough power, we're not children.

Its always the engineer fault. :(

alrs · on May 27, 2014

Systems engineers, software engineers, architects, whatever. We're all in the same gang.

My point is that the problem in this case is likely the system's design, not one engineer's typing abilities.

jsmthrowaway · on May 27, 2014

This comes down to operational philosophy, in the end. The point you're dancing around is whether the system should permit grave actions that don't make any sense when you're designing the system.

By the rules, every single system on a commercial aircraft has a circuit breaker. Pilots make the "what if X catches on fire?" case, which is actually pretty compelling. However, that also means there are several switches overhead that will ostensibly crash the airplane if pulled. Pilots lobby very strongly for the aircraft not to fight them in any way because they are the only ones with the data, in the moment, now. They have final command over the aircraft in every way.

I use this to point out that as you're designing systems for operations people -- something we're increasingly doing ourselves as devops/SRE takes hold -- you might think you can anticipate every scenario and design suitable safeguards into the system. However, sometimes, when Halley's Comet refracts some moonlight into swamp gas and takes your fleet down, you as an operator have to do some really crazy shit. It's in that moment, when all hell has broken loose, I'm at the helm, and based on the data available to me I have made a decision to shoot the system in the head: if the system fights me and prolongs an outage because we argued about whether we'd ever need to reboot a fleet all at once, I'm replacing the system as the first item in my postmortem. If you make me walk row to row flipping PDUs, we're going to have words.

That's just my philosophy. Give the operators the knives and let them cut themselves, trusting that you've hired smart people and understanding mistakes will happen. Your philosophy may vary. By all means, ask me to confirm. Ask me for a physical key, even. But if you ever prevent me from doing what I know must be done, you are in my way. I have yet to meet a system that is smarter than an operator when the shit hits the fan (especially when the shit hits the fan).

There's probably a broader term for operational philosophy like this.

mcherm · on May 28, 2014

...and the operations version of that is that all normal operations are performed under restricted permissions that cannot "do anything", while the full "do anything" permissions are only broken out during a major crisis.

Such an approach would have prevented this incident where "normal" operations were being performed and accidentally ALL the servers were rebooted at once.

alrs · on May 27, 2014

I tend to agree with you, with the caveat that you can't have this philosophy and sell your customers 99.999% uptime[0].

[0] http://www.joyent.com/products/compute-service/features/linu...

jsmthrowaway · on May 27, 2014

I disagree wholeheartedly. Your operational philosophy complements your SLA goals, it doesn't force them.

alrs · on May 28, 2014

I can't figure out how your comment that "understanding mistakes will happen" is compatible with 99.999% uptime.

I'm of the opinion that 99.999% for an individual instance isn't particularly achievable in a commodity hosting environment. That kind of uptime doesn't leave much room for the mistakes that you and I both anticipate.

I do think that 99.999% is doable for a properly distributed whole-system across multiple geographically-dispersed datacenters.

I think Joyent has gone wrong in promoting individual instance reliability.

elijahwright · on May 28, 2014

Always ask how numbers like that are computed.

mvc · on May 28, 2014

They're not. That's a statement of what customers have enjoyed up until now. The actual SLA simply states what refund you get for each hour of downtime.

CHY872 · on May 27, 2014

It's a combined fault. Clearly the operator made a mistake, but the system shouldn't have let such a calamitous operation take place without at least three levels of "Are you sure" (or something smarter like "Confirm how many servers would you like to reboot:") before it lets you take down thousands of servers.

berns · on May 27, 2014

Joyent's marketing is not the most transparent. They haven't updated AWS prices in their pricing page since AWS lowered their prices two months ago.

4ad · on May 28, 2014

What?

Joyent doesn't use AWS.

Diederich · on May 27, 2014

The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.

That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.

So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.

Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.

One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.

jameshart · on May 27, 2014

DevOps means being able to take out an entire datacenter with a single keysstroke...

stephengillie · on May 27, 2014

As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.

michaelt · on May 27, 2014

You don't intentionally build an automated way to take down all your servers at once.

You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.

Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

codexon · on May 27, 2014

Once I typed

  rm -rf logs_ *

instead of

  rm -rf logs_*

linker3000 · on May 27, 2014

Our less-than-savvy Financial Director took it upon himself to restore from tape the bought ledger files to a live system after a slight mishap. Unfortunately, the bought ledger files all started with a 'b' and he managed to restore them to the root of the -nix system instead of the right place, so he mv'd b* to the right location.

All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.

Edit: He had access to the root account to maintain the accounts app (not my call)

knodi · on May 27, 2014

I have nightmares about such things.

akerl_ · on May 27, 2014

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.

tommu · on May 27, 2014

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

tommu · on May 27, 2014

And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.

cpayne · on May 27, 2014

I believe you were down voted not for what you said, but the way you have said it.

I've been down voted several times for (what I see) as relatively minor remarks. The HN readers are a sensitive bunch...

stephengillie · on May 28, 2014

Those who are still posting on HN are orders of magnitude more sensitive than those who post on Imgur. The communities are similar-size, yet Imguraffes are much, much more accepting of my comments. What merits a handful of upvotes there brings a downvote or two on this site.

stephengillie · on May 28, 2014

You're assuming my management has been paying for good networking architecture for the past dozen years.

tommu · on May 28, 2014

I believe it. Networking is seen as a commodity now. It's transparent until it fails. There's a whole lot of technical debt lurking out there. I personally have seen the dark shadow of spanning tree suck the light out of DevOp engineers eyes.

akurilin · on May 27, 2014

DevOps Borat is going to have a field day today.

shiftpgdn · on May 27, 2014

Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.

shagie · on May 27, 2014

At one point, I worked in a computer lab that was mostly Ultrix machines. The shutdown grace period was specified in minutes ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

System dax shutting down in 5 seconds.

_hnwo · on May 27, 2014

Might I suggest molly-guard: https://packages.debian.org/unstable/admin/molly-guard

imrehg · on May 28, 2014

Cheers for this! Would have saved me so much grief before. Now going around and installing it on the servers I manage (fortunately nothing mission critical, but many remote).

larrys · on May 27, 2014

"when I got the SSH windows confused"

I've come close to that as well.

This reminds me of the paradox of being competent vs. a beginner.

It also has parallels in a few thing outside computing.

Beginners make different mistakes because they don't know enough to go quickly.

Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.

With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)

While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well [1] simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".

[1] I typically do

for i in something* do echo $i done

Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.

(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)

cordite · on May 27, 2014

I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).

It's not a good feeling.

smtddr · on May 27, 2014

This happened to me once; I don't know if this works on all linux distros but if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked on some Ubuntu version a few years back.

Give it a try on some system that's not critically important :)

cordite · on May 27, 2014

Yeah, but the problem is when you honestly didn't realize calling a halting shutdown until the server doesn't come back 5 minutes later and then you review the terminal

icebraining · on May 27, 2014

I tend to use /sbin/reboot instead, it amounts to the same (calls shutdown), but it's harder to get it mixed up.

stavrus · on May 27, 2014

A similar case happened with the Eve Online cluster (~50,000 concurrent users) a couple of years ago. A programmer, who for some reason had access to the live cluster, confused his local development instance with that of the live cluster and issued a shutdown. Luckily they were able to avert the incoming disaster in time (it was a timed shutdown), but jokes are still made about the mistake.

http://oldforums.eveonline.com/?a=topic&threadID=1232785

dharbin · on May 27, 2014

salt '*' system.reboot

quickdry21 · on May 27, 2014

> hubot restart all on prod

oh shit i meant stag fuckfuckfuckfuck

qbrass · on May 27, 2014

So have hubot second guess any changes to production unless you specifically told it you were messing with prod beforehand. Have it wait a few seconds before doing something important and listen for sounds of regret.

>hubot restart all on prod

hubot: > say "Hubot isn't responsible for hosing production because I actually meant staging"

>Hubot isn't responsible for hosing production because I actually meant staging

hubot: okay, don't say I didn't warn you.

>oh shit i meant stag fuckfuckfuckfuck

hubot: I hadn't started yet, but I'm doing it anyway just to teach you a lesson.

angersock · on May 27, 2014

Why in the name of all that is holy do you have Hubot getting access to your production boxen?

Why does that seem like a good idea, ever?

akoumjian · on May 27, 2014

My thought, exactly. Time to setup some good ACL :-) http://docs.saltstack.com/en/latest/ref/clientacl.html

jordanthoms · on May 27, 2014

Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

hack_edu · on May 27, 2014

Not even just the plug. I've had outages from bits flipped simply by the static electricity generated when vacuuming near servers.

linker3000 · on May 28, 2014

5W walkie talkies in a big sports complex with the RF getting into the keyboard controllers and acting like a maniac was punching the keyboard - would eventually hang the servers.

Fix: Replace cheapened-keyboards-with-mylar-film-(not)-screening with older models that had a full metal cage around the keyboard assembly.

socceroos · on May 27, 2014

You have carpets in your server room??

saganus · on May 27, 2014

I assume bash.org?

wiml · on May 28, 2014

It's a truly ancient anecdote; it probably predates the Internet.

The first example in RISKS is in 1994: http://catless.ncl.ac.uk/Risks/15.59.html#subj3.1 but the canonical version of the story is in a Cape Town hospital in 1996: http://web.archive.org/web/20040624065333/http://www.legends...

gknoy · on May 27, 2014

He might be referring to The daily WTF (worse than failure):

http://thedailywtf.com/Articles/I-Didnt-Do-Anything.aspx

Unintentional Mishap while Contractor Unplugs X to fix/maintain Y is a relatively common theme on their list of horror stories.

edit: I think he might actually have meant this one: http://thedailywtf.com/Articles/I-Told-You-So.aspx

saganus · on May 28, 2014

And I get 2 downvotes for this? really? downvoters care to explain why, just for asking if it was a reference from bash? Wow... Edit: Thanks to the other 2 posters who provided alternative sources. You learn by asking, no? or at least some of us do..

devinegan · on May 27, 2014

Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...

rincebrain · on May 27, 2014

Howso?

devinegan · on May 28, 2014

Thanks for asking rather than just down-voting. I wanted others to know that this isn't isolated. We have been having issues with their service for a few months now. They never know when there is a problem with hardware, for instance. Joyent support will gladly tell you everything is fine. After you insist, and insist they will actually have someone look at the underlying infrastructure. Eventually they will acknowledge the problem and fix it (maybe). I believe the monitoring and reporting for their team is flawed or incomplete which leads to more downtime of affected systems. Just one observation, but we have had three incidents over the past month and a half. Two within a week of each other.

bcantrill · on May 28, 2014

I'm sorry to hear about your experience; we pride ourselves on being able to root-cause problems regardless of where they might be in the stack, but it sounds like your problem didn't get properly escalated. If you want to reach out to me privately (my HN login at acm.org), we can try to figure out what happened here -- with my apologies again for the subpar experience.

shanselman · on May 27, 2014

"What's this button do?"

SEJeff · on May 27, 2014

As I've always said, "You can never protect a system from a stupid person with root".

You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)

akerl_ · on May 27, 2014

If you think that "hire great sysadmins" prevents somebody from fatfingering, you must be hiring from some more evolved species. Nobody is immune to mistakes; preventing this kind of issue is something the infrastructure and procedures should do.

llamataboot · on May 27, 2014

I don't think "just hiring great sysadmins" is possible. People have off-days or are tired or sick, new people get on-boarded, even great people make mistakes, etc.

protomyth · on May 27, 2014

...or accidentally switch which of the 25 term sessions they had open

I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.

incision · on May 27, 2014

Not only do you consider mistakes the province of stupid people doing dumb things, but you're crediting yourself with a proverb about it and suggesting that you posses the ability to sniff these people out from the 'great' ones.

Get a grip, you're recursively full of yourself.

SEJeff · on May 28, 2014

Wow HN seriously!? I never once pretended that I'm able to hire people who don't make mistakes, only that you can't protect systems from administrators who mess up.

Get a grip people.

wmf · on May 27, 2014

So don't give anyone root on an entire data center.

akerl_ · on May 27, 2014

Is this like Captain Planet? It's a bit exceptional to divide access servers of similar type between administrators such that individuals have full access to a portion of the fleet. Do they meet up and put their rings together to roll out updates? What if one of them goes on vacation?

lmm · on May 27, 2014

There are keysharing protocols; you can do something like 5 sysadmins have a split of the master key such that any 3 of them can access the master account.

akerl_ · on May 27, 2014

For day-to-day maintenance of systems, that's crippling. If I need 2 cosigns to run "date" across the fleet while I'm troubleshooting an NTP issue, and then 2 cosigns again to run "service ntpd status", and so forth, my coworkers will have lit my desk on fire long before I fix the clocks.

There are definitely use cases for keysharing systems like you describe: if we're talking about getting access to a database with sensitive information, or signing a new cert that all our systems are about to put their full faith in. But for the day-to-day administrative efforts, it's overkill and ends up being counterproductive: after a certain point, Alice and Bob write scripts that let them hotkey signing off on my requests.

rodgerd · on May 27, 2014

I'm not worried about how crippling that sort of scenario is on a day to day basis, because presumably the company doesn't mind paying a fortune for a bunch of people to sit around to hold one anothers' keys.

I worry about those policies when the shit hits the fan and you're trying to fix a production problem hobbled by an inability to do stuff without three fingers on every keystroke.

akerl_ · on May 27, 2014

Agreed. Ideally, whatever system is in use for managing infrastructure provides sanity checks while I'm working, but either gets out of my way or can be sidestepped if need be. I don't want to be crippled by technical red tape when things are on fire.

aiiane · on May 27, 2014

"date" and service status don't typically require root.

richardkmichael · on May 27, 2014

I've not needed this, but it's a nice idea. Do you do this with a combination of sudo/PAM|pubkey auth? I can google, but can you push me off in the right direct? Thanks!

lmm · on May 28, 2014

I've not been directly involved, so your googling may well be as good as mine; on a quick look you might have to do this manually using ssss (and then each person encrypts their piece with gpg --symmetric or the like).

0xbadcafebee · on May 28, 2014

Actually, capabilities makes it trivial to lock down things like shutdown for admin accounts. A script can do the shutdown instead in a more controlled and less error-prone fashion. Same for network device updates. Abstraction.

mikeash · on May 27, 2014

That has its own risks. There might be some catastrophe that need root access on everything to fix, and you can't reach enough people to get it....

vertex-four · on May 27, 2014

Then put the keys to datacenter-wide root somewhere safe (with a manual-ish process to access and use them), but out of the way and with alarms on it (the same alarms that you'd use in the absolute worst situation possible). Make sure anyone using it will be shamed if they don't absolutely have to.

akerl_ · on May 27, 2014

Shame is a terrible tool for ensuring compliance. The people you want to keep will resent the fact that you're using shame as a motivating factor.

jsmthrowaway · on May 27, 2014

If you think keys in a safe is a good idea, ask a Googler about the legend of the Valentine safe. Short version: nobody was able to get into the safe and a locksmith had to come drill it to restore a critical service.

It's also a cautionary tale about testing your DR occasionally.

Gravityloss · on May 27, 2014

You're going to need a bigger crew.