Hacker News new | past | comments | ask | show | jobs | submit login

Most companies completely missed the point of SRE/PE/DevOps and keep them on separate teams doing sysadmin toil work and oncall thrown over the wall by engineers who are only concerned with feature deadlines. They regress them back to sysadmin duties and get none of the value of a true SRE program.

SRE should always be a subtitle for a SWE and not a separate position, and they should always be embedded with SWEs into one team either building products of infrastructure. The shared ownership and toil reduction only works if you have these two things.

All this said, I think the regression is also due to the fact that real SREs are rare. A solid SWE that also has deep systems domain knowledge, understanding how to sift through dashboards and live data, and root cause complex performance problems is a master of many domains and is hard to find.




The regression is also due to that a real SRE is expensive. It's cheaper to just get some newly grads to react to alarms following a set runbook of what to do if that alarm triggers.

VERY few companies operate at googles scale. For 99.99% of companies it makes sense to investigate single machine issues.


Google SREs also end up investigating single machine issues, fyi.


Yes, but At Scale®

It's a totally different experience when you have the people who technically own the hardware side of the operations taking no responsibility for the well-being of it, and the people who own the software developing elaborate workarounds for bad machines, and the SREs maintaining blacklists of individual nodes.


In my experience it's fun to do that but only worth it when SLOs are on the line (so a significant number of bad machines).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: