Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Here’s what happens when Heroku goes down (gigaom.com)
20 points by craigkerstiens on Sept 30, 2011 | hide | past | favorite | 5 comments


One way to reduce outages and bugs would be to put an incentive plan in place that pays bonuses to the folks on-call for each time they fix a problem while on-call.

This would have two effects:

- Management would strongly encourage the design and implementation of systems that fail less, which results in less payout of these bonuses.

- Employees would want to more willingly be part of the on-call process.

Imagine a group of your workforce eagerly waiting to fix an impending failure while another opposite group is eagerly making sure those guys don't get paid a bonus.

And, you could tie it all together with bonuses for everyone when you meet certain performance levels.


And employees would also put bugs in to reap the reward.


I certainly don't disagree that there are opportunities to game the "system," but in startup environments there are lots of people keeping a close eye on development and production environments. Someone gaming it will quickly be exposed.

Just to clarify, I'm not recommending this approach to enterprise corporate environments where layers upon layers of management and developers could easily derail what I am suggesting.

But for the startup, it is worthy of consideration.


  In most cases the pages, which arrive about two or three
  times during a 24-hour on-call period, require the engineer 
  to take down the problematic instance and restart it.
I've never held a position that required a pager, but I always assumed that pages to on-call support people were for emergencies only. This seems like they're paging for non-emergency things that need to get looked at; wouldn't an email work just as well?


When an instance is "degraded", they need to start another instance quickly, or else the apps hosted on that instance will be down for good. So it's an emergency.

I think they don't have enough confidence in their detection of "degraded" instances to do it automatically, so it requires human intervention.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: