> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.
Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.
I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
This seems like pretty basic SRE stuff.