Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.
There is no nuance needed - this is a giant corporation that sells kernel layer intermediation at global scale. You better be spending billions on bulletproof deployment automation because *waves hands around in the air pointing at whats happening just like with solarwinds*
Bottom line this was avoidable and negligent
For the record I owned global infrastructure as CTO for the USAF Air Operations weapons system - one of the largest multi-classification networked IT systems ever created for the DoD - even moreso during a multi-region refactor as a HQE hire into the AF
So I don’t have any patience for millionaires not putting the work in when it’s critical infrastructure
People need to do better and we need accountability for people making bad decisions for money saving
Almost everything that goes wrong in the world is avoidable one way or the other. Simply stating "it was avoidable" as an axiom is simplistic to the point of silliness.
Lots of very smart people have been hard at work to prevent airplanes from crashing for many decades now, and planes still crash for all sorts of reasons, usually considered "avoidable" in hindsight.
Nothing is "bulletproof"; this is a meaningless buzzword with no content. The world is too complex for this.
I am not defending of excusing anything. I am saying there is not enough information to make a judgement one way or the other. Right now, we have almost zero technical details.
Call me old-fashioned and boring, but I'd like to have some basic facts about the situation first. After this I decide who does and doesn't deserve a bollocking.
No, CS has explicitly stated that the cause was a logic error in the rules file. They have also stated "This is not related to null bytes contained within Channel File 291 or any other Channel File."
It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”
Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the things that would have prevented this? If they cut corners, why? It’s not useful or productive to throw around accusations or demands for specific improvements without answering questions like these.
Not an excuse - they should be testing for this exact thing - but Crowdstrike (and many similar security tools) have a separation between "signature updates" and "agent/code" updates. My (limited) reading of this situation is that this as a update of their "data" not the application. Now apparently the dynamic update included operating code, just just something the equivalent of a yaml file or whatever, but I can see how different kinds of changes like this go through different pipelines. Of course, that is all the more reason to ensure you have integration coverage.