DevOps means being able to take out an entire datacenter with a single keysstrok...

stephengillie · on May 27, 2014

As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.

michaelt · on May 27, 2014

You don't intentionally build an automated way to take down all your servers at once.

You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.

Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

codexon · on May 27, 2014

Once I typed

  rm -rf logs_ *

instead of

  rm -rf logs_*

linker3000 · on May 27, 2014

Our less-than-savvy Financial Director took it upon himself to restore from tape the bought ledger files to a live system after a slight mishap. Unfortunately, the bought ledger files all started with a 'b' and he managed to restore them to the root of the -nix system instead of the right place, so he mv'd b* to the right location.

All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.

Edit: He had access to the root account to maintain the accounts app (not my call)

knodi · on May 27, 2014

I have nightmares about such things.

akerl_ · on May 27, 2014

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.

tommu · on May 27, 2014

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

tommu · on May 27, 2014

And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.

cpayne · on May 27, 2014

I believe you were down voted not for what you said, but the way you have said it.

I've been down voted several times for (what I see) as relatively minor remarks. The HN readers are a sensitive bunch...

stephengillie · on May 28, 2014

Those who are still posting on HN are orders of magnitude more sensitive than those who post on Imgur. The communities are similar-size, yet Imguraffes are much, much more accepting of my comments. What merits a handful of upvotes there brings a downvote or two on this site.

stephengillie · on May 28, 2014

You're assuming my management has been paying for good networking architecture for the past dozen years.

tommu · on May 28, 2014

I believe it. Networking is seen as a commodity now. It's transparent until it fails. There's a whole lot of technical debt lurking out there. I personally have seen the dark shadow of spanning tree suck the light out of DevOp engineers eyes.

akurilin · on May 27, 2014

DevOps Borat is going to have a field day today.