Most companies completely missed the point of SRE/PE/DevOps and keep them on sep...

snowfield · 2024-03-03T17:46:50 1709488010

The regression is also due to that a real SRE is expensive. It's cheaper to just get some newly grads to react to alarms following a set runbook of what to do if that alarm triggers.

VERY few companies operate at googles scale. For 99.99% of companies it makes sense to investigate single machine issues.

bananapub · 2024-03-03T18:39:19 1709491159

Google SREs also end up investigating single machine issues, fyi.

jeffbee · 2024-03-03T18:52:05 1709491925

Yes, but At Scale®

It's a totally different experience when you have the people who technically own the hardware side of the operations taking no responsibility for the well-being of it, and the people who own the software developing elaborate workarounds for bad machines, and the SREs maintaining blacklists of individual nodes.

dpbriggs · 2024-03-04T02:02:22 1709517742

In my experience it's fun to do that but only worth it when SLOs are on the line (so a significant number of bad machines).