Hacker News new | past | comments | ask | show | jobs | submit login

Thank you for calling out on-call responsibilities in your job listing. Too many job listings today fail to mention that _very significant_ responsibility.

I enjoy working with distributed storage systems, but I don't think I will ever carry a pager for one again. I wish the industry could figure out how to separate designing and building such systems, from giving up your nights and weekends to operate them.




Separating design and build from operate is antithetical to Amazon. It isn’t a “figure out” for a lot of companies including Amazon — it’s very intentional and seemingly unlikely to change. They’ve observed that they create a stronger culture of ownership (which then drives getting things fixed faster and more empathy for the customers) through having the builders also be the operators.

Still needs supportive management: there are teams at Amazon who have time to fix everything which paged them at anti-social hours, and there are teams which don’t prioritize beyond minding the SLA of their COE Action Items, and more silently accrue operational debt and page people more often. Tricky balance to be sure.

Even the ‘SRE’ or ‘PE’ approaches you see at Google and Meta don’t obviate the need for development teams to have on-call rotations. At least in “BigTech” where teams operate services instead of shipping shrink-wrapped software it’s becoming rare to NOT see some on-call responsibility with engineering roles (including management). I suppose it isn’t just on-call, and the other big change in BigTech of the last decade was the somewhat widespread elimination of QA teams and SDET roles, and the merger of those responsibilities into the feature/service teams, and to SDE.


There's different schools of thought around this and I certainly understand your perspective. At AWS, carrying a pager at limited times (in our team, 2-3 weeks per quarter as mentioned in the link) is considered an important part of our culture of operating at-scale services. In our team, we try to minimize oncall burden as much as possible by investing in automation, and only alarm if the system really doesn't know what to do. We have a separate planning bucket for burden reduction every quarter.

Other interesting thing to mention is that as an SDE you're not the only one that has oncall duties. In our team at least, PMTs are also oncall for about the same time. This creates a good dynamic as everyone is incentivized to minimize the oncall burden.


Being on call aligns incentives. If it's someone else's problem when what you just design and build then it will operate less well.


Isn't that the idea behind separating out the SRE (site reliability engineer) role from software engineering?


Sort of. Many teams in FAANG put their devs on rotations that aren't full on-call like SRE (and some managers put their devs into full SRE rotations without mentioning there is a bonus). I always check with my future managers that they don't plan to do this.


Haha well aware as a current on call SDE at one of them!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: