> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.
Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.
I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
> Why would anyone expect anyone else to serve video for them for free?
I would expect that a freemium service selling encrypted "zero trust" networking should have no idea what traffic is being pushed through my network making enforcement impossible.
Nobody's asking for a free lunch, but the reasonable thing to do would be to simply bandwidth limit freemium accounts across the board, not make exceptions for certain kinds of traffic in what should be a secure network.
Cloudflare does say "video and other large files" so in the end it is about volume, not data type. They probably just want to have the arbitrary decision on specific cases without defining a uniform blanket limit.
> they know that if other languages that address one of the main rust claims without all the cruft gains popularity they lose the chance of being permanently embdedded in places like the kernel
First of all, I'm really opposed to saying "the kernel". I am sure you're talking about the Linux kernel, but there are other kernels (BSD, Windows etc.) that are certainly big enough to not call it "the" kernel, and that may also have their own completely separate "rust-stories".
Secondly, I think the logic behind this makes no sense, primarily because Rust at this point is 10 years old from stable and almost 20 years old from initial release; the adoption into the Linux kernel wasn't exactly rushed. Even if it was, why would Rust adoption in the Linux kernel exclude adoption of another language as well, or a switch to another, if it's better? The fact that Rust was accepted at all to begin with aside from C disproves the assumption, because clearly that kernel is open for "better" languages.
The _simplest_ explanation to why Rust has succeeded is that it's solves actual problems, not that "zealots" are lobbying for it to ensure they "have a job".
Rust is not stable even today! There is no spec, no alternative implementations, no test suite... "Stable" is what "current compiler compiles"! Existing code may stop compiling any day....
Maybe in 10 years it may become stable, like other "booring" languages (Golang and Java).
Rust stability is why Linus opposes its integration into kernel.
In the "other good news department", GCC is adding a Rust frontend to provide the alternative implementation, and I believe Rust guys accepted to write a specification for the language.
I'm waiting for gccrs to start using the language, actually.
> So now I'm holding firm opinion, that these high-FPS displays are marketing gimmick.
While I agree the jump from 60 -> 140 hz/fps is not as noticeable as 30 -> 60, calling everything above 60 a ”marketing gimmick” is silly. When my screen or TV falls back to 60hz for whatever reason I can notice it immediately, you don’t have to do anything else than move your mouse or scroll down a webpage.
I have no idea about the technical details but I suspect the comparison you’re making isn't that relevant. As I understand it this is just a project that happens to be based on NetBSD, and given enough work you could probably do the same for FreeBSD.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
This seems like pretty basic SRE stuff.
reply