Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Early in my career, I'd cobbled together an internal server which effectively did monitoring (RRDTool) against the internal BugZilla(!) server. The end result was speedily cached charts and graphs of bug counts by team, by project, by release, etc.

Over time, the system had grown to be basically _THE_ source of truth for the release management process for a Top-500 website of the era. Go/No-go decisions were based on looking at the trend-lines for bug graphs. Projects were merged to "trunk" only if their P1 counts were low enough.

When I decided to leave I wanted to make sure things were left in a sustainable place, and asked around for who to transition the server to (it was a random beige box in the QA Lab, circa 2005). "It's Debian running MySQL, it has an auto-update script which will bug you if you need to update packages, here's how to restart the server, here's the root password, here's how to update/deploy it, docs are in this wiki."

I'd given 4 weeks notice, and week 2.5 rolls around and even though I'd kept asking around about who to transition this pet project to, nobody was swinging by my desk (so to speak) to pick up the responsibility.

So I made a little script: `@hourly: if (( $RANDOM % 2 )) ; then ifdown eth0 else ifup eth0; fi`

Suddenly people started showing up at my desk... "What's going on, can you fix it, what's happening?" Sorry, nope, can't fix it, the network on it is down... see, I can't ssh into it, by the way, this is what could very easily happen once I've left. Just wait an hour or two and it should come back online. If you send someone over who you want to be taking care of this machine I can walk over to the keyboard with them and show them how to fix it.

Once they committed to sending over someone to transition to, I turned down the BOFH to a 25% chance of 15 minute outages...and all was right with the world. Later on that week, a really good guy, one of the main end-users of the software (sorry Mike!) came over and explained the relative immaturity of my actions, but I couldn't help but point out their effectiveness.

In retrospect, it was an internal "API brownout" but that didn't have a name yet. The concept of "controlled unreliability" (eg: chaos-monkey) being a tool to increase business resiliency is something to take to heart, if even only unofficially.



So did you eventually transition it to someone? Wouldn’t be the first time that I’ve seen someone commit to something and not do it without notice.

Or learn to live with random 15m outages because it’s apparently less frustrating than doing something.


Yeah, minimally I could pass off the root password and after that I turned off the unreliability script.

In retrospect, probably frightening how quickly the business could become accustomed to random 15m outages as a cost of doing business.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: