I read the part where they said they poured through "hundreds of sentry logs" and immediately was like "no you didn't."
This is not an error that would be difficult to spot in an error aggregator, it would throw some sort of constraint error with a reasonable error message.
> we implemented logic to prevent the node from going down at the top of a minute when possible since — given the nature of cron — that is when it is likely that scripts will need to be scheduled to run
Why not smear the start time of the jobs across seconds of that minute to avoid any thundering herd problems? How much functionality relies on a script being invoked at exactly the :00 mark? And if the functionality depends on that exact timing, doesn’t it suggest something is fragile and could be redesigned to be more resilient?
At their scale, staggering script start times over a 60 second window likely wouldn’t have much of an impact if they are experiencing a thundering herd, imo. If it did help, it would be a bandaid and ticking time bomb before someone has to actually solve the load problem that staggering start times kicked down the road
This seems like a tautology because what we think of as a good “experience” is something that lets us get our work done quickly. No one ever said a tool had great devex but was hard to get stuff done with.