Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd be very surprised if disk/machine rot is what it would take for such a large organization to start experiencing major problems. Strange user patterns from all this and recent features/refactors that are suddenly not monitored would hurt something in the stack within a couple weeks if it was really abandoned. Having a few dozen senior engineers/SREs familiar with big chunks of the infra could make it months though.


Exactly. Things that are abandoned will eventually fail.

Not everything is automated. There's a finite amount of time to create new automation, and that time goes first toward the things that happen hourly or daily. The things that happen monthly still get handled by humans. Not everything is in the runbook either, and what is often ends up scattered among many wiki pages and help notes in tools, so it's not easy for a newcomer to find. So those monthly things have to be done by people who remember.

They remember how to clean stuff up when a quota/capacity limit is being approached, and who to call when they need more, and they can expect a response. They remember how to recognize when their service is approaching overload, or about to enter an oscillating state, and they know how to nudge it back toward sanity before the errors start to pile up. None of that happens when whole groups are gone.

Sooner or later, a service will start to run off the rails in one way or another, and the person who inherited that service will either not recognize it or not find the solution in time. Then it's Fail Whale time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: