I'd be very surprised if disk/machine rot is what it would take for such a large...

notacoward · on Nov 18, 2022

Exactly. Things that are abandoned will eventually fail.

Not everything is automated. There's a finite amount of time to create new automation, and that time goes first toward the things that happen hourly or daily. The things that happen monthly still get handled by humans. Not everything is in the runbook either, and what is often ends up scattered among many wiki pages and help notes in tools, so it's not easy for a newcomer to find. So those monthly things have to be done by people who remember.

They remember how to clean stuff up when a quota/capacity limit is being approached, and who to call when they need more, and they can expect a response. They remember how to recognize when their service is approaching overload, or about to enter an oscillating state, and they know how to nudge it back toward sanity before the errors start to pile up. None of that happens when whole groups are gone.

Sooner or later, a service will start to run off the rails in one way or another, and the person who inherited that service will either not recognize it or not find the solution in time. Then it's Fail Whale time.