Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:

1. Push out files A/B to ensure the old file is not removed.

2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.

This seems like pretty basic SRE stuff.





Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.

I'm shocked that an automatic canary rollout wasn't an action item. Pushing anything out globally is a guaranteed failure again in the future.

Even if you want this data to be very fresh you can probably afford to do something like:

1. Push out data to a single location or some subset of servers.

2. Confirm that the data is loaded.

3. Wait to observe any issues. (Even a minute is probably enough to catch the most severe issues.)

4. Roll out globally.


Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.

I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.

Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: