Hacker News new | past | comments | ask | show | jobs | submit login

> Importantly, engineers should be on-call for their code - being on-call creates a positive feedback loop and makes it easier to know if their efforts in writing production-ready code are paying off. I’ve heard people complain about the prospect of being on-call, so I’ll just ask this: if you’re not on-call for your code, who is?

OK, that's fine, but in that case we can shut down the production system outside of business hours so that our work-life balance isn't affected. Oh? We can't shut down the production system outside of business hours? So we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable? That sounds like a we're expected to give up our personal lives at a moment's notice? Interesting, hmm.




Not 24/7/52, on a rota. And paid. It doesn't mean no work-life balance, it means part of the work is being available when on-call (limited 'life'), and life as normal when not.

It's hardly some horrendous controversial idea, nor unique to software engineering.


Right.

At several places I worked (and others for which I asked in job interviews), the general amount which companies get away with (and employees find bearable) seems to be <= 5 weeks of on-call per employee and year.

And obviously you're being paid to be on standby, and then paid for your overtime should an incident occur.


Are construction workers on call for the buildings they build? Maybe. But they'll be called like once every 5 years. Because their industry actually has standards.

Devs think they're hot stuff, when in reality we're probably one of the most abused professions out there. (I'm talking about regular devs, not people who were born in wealth/went to good schools etc)


While they're building, yes there are absolutely people on call for issues on the site out of hours.

At my former employer I was on an on-call rotation; I'm obviously not now it's a 'former' employer, so the building analogy doesn't really hold up. (And not just leaving the company, but e.g. my former colleagues now working on something else at the same company aren't on call for the software they wrote but are no longer responsible for.)


>>> And paid

The article did not mention anything about pay or compensation for oncall.


The article also didn't mention whether you are notified via email, SMS, or Slack; that seems like detail handled by the business.


When it comes to money, that is a rather important 'detail'. Especially given the fact that the most prevalent form of theft is wage theft: https://www.gq.com/story/wage-theft


For the purposes of the article, that's still a business concern. Presumably on-call expectations are part of the compensation agreement. (as they are with most industries)


I can't think of a single job/interview/offer where the expectations or the compensation for on call were discussed formally, let alone agreed.

The best I get in job interviews is usually a mention that there is a rota every X period. Then have to poke interviewers trying to guess what is it like without coming up as too negative, "when is the last time you worked on a week end?" "when is the last time you were awaken in the middle of the night?"


There is always a cost to software in production.

The issue is who pays and when?

You can pay that cost upfront - for example JPL/NASA SDLC. This will ensure you won't get woken at odd hours but then the massive upfront cost is something most business won't pay

You can sling code without tests and fix it in prod, hoping speed will help you find product market fit.

Pretty much everyone sits somewhere between the two. This article just describes one point onthe spectrum where the author feels is best practise - but to be honest the trade offs vary across this spectrum.

Probably the right way to think of this is "the total cost of making this software NASA level is 10X, and the revenue from such perfect working software would be 20X (with no loss due to downtime)

As such of you ask me to not code to NASA standards, I and my team will incur a personal cost of 5X in being woken up, stressful release days etc.

Therefore you will compensate me with payments of 5-10X.

This discussion is much easier with a Union involved


So developers who are on-call are paid 5-10x as much as JPL/NASA engineers?


No.

Ok - there is a spectrum of reliability - lets say that NASA produces the most reliable code anywhere, and that it has a very high cost to produce code like that. At the other end of the spectrum is some guy slinging php code out without any testing, hoping that it will turn into the next unicorn.

If we asked both ends of the spectrum to write code to solve the same business (Pet food delivery app) then the guy slinging PHP will get woken up at 2am regularly because the server is always crashing. The NASA guy will never get woken up, but also the app probably will be out on the market a year after the first one.

So the business has to choose a trade off - sling code and get lots of 2 am wake up calls or wait and possibly lose market share to a competitor.

Now there was a famous example of a Reddit co-founder who slept next to his laptop and just rebooted the server every two hours till they discovered Python Supervisor. Now that seems ok - the business (co-founder) was making the trade off and exploiting the worker (same co-founder). The worker was happy to take the job because they were likely to get paid if it all worked out (and it did)

The issue comes when the worker on call is not making the business judgement. How much should they demand in payment?

If they have a healthy equity payment in a growth company, that might work just like the above founder. Otherwise the payment needs to come out of the money not spent.

SO I guess my argument is that there is a fixed cost to reliable software to the business - it should either pay for highly reliable software, or it should pay the saved cash to the code clinger for each time the server goes down.

This will change the trade off mathematics.


I actually think it is a good feedback loop, but it needs to be staffed well and costs need to be well understood before a startup just takes on global 24/7 product and hoping the devs sort it out.

I did 24/7 support solo and with just 1-2 other devs for years on global system and never again will I do DevOps in such a small team with on call 24/7 requirements. The cost of maintaining features and systems varies so that having great enterprise support can be a non issue or a constant headache that you have little to no control over (Eg a system that takes a dependency on external data you can’t control).

On top of pushing features out constantly, maintaining quality and automating everything you can, a startup can easily fall into building systems their staff have trouble maintaining without impacting output significantly as well as impacting mental health of their devs. I think the problem is that it is hard to see these costs up front as you can build systems these days on cloud providers where most of the time things will come back on their own without intervention, but obviously depends what impact being offline for 5 mins vs 3 hours has on the business.


Isn't that answered in your quote: "I’ve heard people complain about the prospect of being on-call, so I’ll just ask this: if you’re not on-call for your code, who is?"

To me this implies an on-call rotation where you know your expectations. Not "we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable". Many other industries have the idea of being on-call, and they are "expected to give up our personal lives at a moment's notice" when they know they are on-call. (For example, my brother in law is a surgery tech; he's had to take off during family outings more than once)

Also, if this happens often enough that it's a serious problem, this says a lot about the quality of the code you own.


Code quality is a result of many factors, but the single biggest factor is whether Management treats developers as a profit centre or as a cost centre. That makes the difference in:

- Developer compensation

- Training and career development

- Staffing properly i.e. not under-staffing

- Giving devs proper slack time between tasks and not over-burdening them with projects

- Letting developers own the stack not just in name only but truly own the technical decisions made in the stack without micro-management, including choice of language, platform, etc.

Without all those factors, it's a red herring to point to the code quality. The code quality is just the final output of all of the above decisions.


If your writing software or operating infrastructure, you need to be on call. Otherwise you don’t have skin in the game. It makes you a better engineer, and at most places increases the quality of your software in two ways. One, you don’t want to get that call at 2am, so you think more about reliability, edge cases, writing playbooks etc. Secondly, when things do go wrong, you perform a post mortem and you get the action items in your stream of work. Additionally you should always track on call stats and use it as a metric in your team health checks. If people are getting called a lot out of hours, it’s time to pull the cord, and sort it out.


Is the project manager called too? Or is it just the developer who gets pressured in delivery and then pressured in maintenance, basically at the very low end of the food chain?


Kind of. Who ever is on call should be able to deal with the issue. Your highest level of escalation should be your engineering manager, VP of engineering or CTO. Your product manager should care about team health checks, and how much you are getting paged, then prioritise engineering time to reduce it. Its about collaboration. If it doesn't work out, maybe you could put your product manager shadow on call too. I haven't ever had to do this, but it could be fun.


We’ve done that: executive team, product owners, and managers. It was only for teams in a particular area of our product. System stability became a funded project.


This “skin in the game” reasoning is nonsensical. Highly competent and passionate engineers don’t avoid mistakes because they’re bummed out about being woken up in the middle of the night with the consequences of them. That thinking is just based a cartoon version of human motivation; its twin is the idea that offering money and promotion as an incentive will lead to better performance.

Your manager may fancy themselves as a latter day Cortés, but you don’t need to play their mind games (most of them based on the misunderstood readings of an unsettled science) to create an effective and high functioning organisation.


But the decision to operate the service or infrastructure 24/7 was a business decision, it wasn't my decision. Why should I be on the hook for a decision I didn't make? And if the business really, really wants 24/7 availability, why should that cut into my personal time outside of work? Why shouldn't the business set up teams in multiple timezones for a follow-the-sun model?

At the end of the day all this 'you need to be on call for your code' is purely a business money-saving ploy. We are an industry full of suckers, I guess, because we fell for the 'plausible-sounding' explanation hook, line, and sinker.


I'm generally for eating your own dogfood, but if we are to play devils advocate for a moment. If everyone is responsible for maintaining their own features, could it incentive you to not ship any features at all?


> It makes you a better engineer,

Nah, it makes you a better serf. Are you working at Amazon and getting paid 400k/year? Sure, do whatever. but regular devs making 70k shouldn't put up with this bullshit.


No mature organisation would expect engineers to be on-call continuously, 24/7. There are ways to have a sane, balanced approach to on-call. See the SRE book for one example: https://landing.google.com/sre/sre-book/chapters/being-on-ca...


So what's a reasonable on-call schedule for developers?


> So we need to be on call continuously, 24/7, meaning we can't ever be off the grid or unavailable?

The well-established solution to 24/7 availability is to operate a shift pattern.


Yep and if you're on-call for a week 24/7, I'd say it's only fair you do nothing else preemptively. Because you need to allow for the possibility of being paged.

No manager or employer would ever buy that shit because it rounds in the direction of less work though.


A shift pattern without anyone on call. 3x 8hr shifts in a 24hr day[1], optionally distributed around the planet so that all shifts are working in their local daytime.

[1] Gene Ray understood this.


Someone is always on call, so it’s just a case of whoa on the hook.


Factually wrong. All developers could turn off their phone/pagers and then there's no on call anymore!

The worst that can happen is that the company is down a few hours overnight. Issues can be investigated and fixed during office hours.

I'd wager that most companies don't have global customers and don't need 24/7 coverage.


> the worst that can happen is the company is down for a few hours overnight

I think this is a great example of why disagreements arise on HN: different world experiences and base assumptions. For many companies, being unavailable for that window of time would be catastrophic. We had one client that suffered about an hour of downtime (turned out to be their issue). They accounted that hour for 5 million dollars lost.


$5M/hour can pay for a lot of engineers. So as you say - with such assumptions - you can and should pay for both up-front design, QA and people on-call, otherwise you only have your self to blame for the loss.


Yes, and in the long term, it's always better/more scalable/efficient to be disciplined enough to be able to have someone on call who has not designed/written the thing running.


Why is that?

The whole "you build it you run it" movement is an attempt to fix dev teams just not giving a fuck about quality of code they put out, especially from a reliability point of view.

Why is the opposite better?


I guess because that enforces the documentation to be good enough so that someone without the faintest idea of what the software does can operate it.

Probably this approach is more scalable specially in big companies where you can have operation teams on-call for a myriad of project.

I personally believe that this does not guarantee a better service.


It's a complement not an opposite.

It's exactly like properly/cleverly documenting your code/project: not only for others now or in a few year, but also for yourself later on.

It's having common rules across teams to get more reliability out of the whole company.

You build it, you run it. Fine. Until the point when you can't anymore (because... reasons - it just happens). In any activity you want to sustain, you always have to have backups (in people and in processes), instead on relying on your-(self/team) alone.


If your teams don’t give a fuck about the quality of their code, why would they give a fuck about the quality of their production support?


Because they'll be rang at 4am...


And what happens when they (or the most competent elements of them) leave? You will need to do in a rush what could have been prepared before.

A whole business takes that into account as importantly as their disaster recovery processes (which is not necessarily something you focus early on, but you eventually do).


+1! And the feedback loop on this method is awesome too. Even in short term should have different parties for Dev and Ops




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: