Yeah... the comment above reads like someone who has read a lot of books on CI d...

AndrewKemendo · on July 19, 2024

There is no nuance needed - this is a giant corporation that sells kernel layer intermediation at global scale. You better be spending billions on bulletproof deployment automation because *waves hands around in the air pointing at whats happening just like with solarwinds*

Bottom line this was avoidable and negligent

For the record I owned global infrastructure as CTO for the USAF Air Operations weapons system - one of the largest multi-classification networked IT systems ever created for the DoD - even moreso during a multi-region refactor as a HQE hire into the AF

So I don’t have any patience for millionaires not putting the work in when it’s critical infrastructure

People need to do better and we need accountability for people making bad decisions for money saving

arp242 · on July 19, 2024

Almost everything that goes wrong in the world is avoidable one way or the other. Simply stating "it was avoidable" as an axiom is simplistic to the point of silliness.

Lots of very smart people have been hard at work to prevent airplanes from crashing for many decades now, and planes still crash for all sorts of reasons, usually considered "avoidable" in hindsight.

Nothing is "bulletproof"; this is a meaningless buzzword with no content. The world is too complex for this.

HL33tibCe7 · on July 19, 2024

> You better be spending billions on bulletproof deployment automation

There is no such thing.

jonathanzufi · on July 19, 2024

You must have insanely cool stories :-)

What are your thoughts on MSFTs role in this?

They’ve been iterating Windows since 1985 - doesn’t it seem reasonable that their kernel should be able to survive a bad 3rd party driver?

AndrewKemendo · on July 20, 2024

1. System high/network isolation is a disaster in practice and is the root of MSFT and AD/ADFS architecture

2. The problem is the ubiquity of windows so it’s embedded in the infrastructure

We’ve put too many computers in charge of too much stuff for the level of combined capabilities of the computer and the human operator interface

chrisjj · on July 19, 2024

So let's hear the "nuances" that excuse this.

arp242 · on July 19, 2024

I am not defending of excusing anything. I am saying there is not enough information to make a judgement one way or the other. Right now, we have almost zero technical details.

Call me old-fashioned and boring, but I'd like to have some basic facts about the situation first. After this I decide who does and doesn't deserve a bollocking.

chrisjj · on July 19, 2024

I think we do have enough info to judge e.g. :This should not have passed a competent C/I pipeline for a system in the critical path."

Thay info includes that the faulty file consisted entirely of zeros.

arp242 · on July 19, 2024

> That info includes that the faulty file consisted entirely of zeros.

Even that is not certain. Some people are reporting that this isn't the case and that the all-zeroed file may be a "quick hack" to send out a no-op.

So no, we have very little info.

chrisjj · on July 20, 2024

But the all-zero file is version CS has IDed as the cause, right?

zmatt · on July 24, 2024

No, CS has explicitly stated that the cause was a logic error in the rules file. They have also stated "This is not related to null bytes contained within Channel File 291 or any other Channel File."

cweld510 · on July 19, 2024

It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”

Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the things that would have prevented this? If they cut corners, why? It’s not useful or productive to throw around accusations or demands for specific improvements without answering questions like these.

jacobr1 · on July 19, 2024

Not an excuse - they should be testing for this exact thing - but Crowdstrike (and many similar security tools) have a separation between "signature updates" and "agent/code" updates. My (limited) reading of this situation is that this as a update of their "data" not the application. Now apparently the dynamic update included operating code, just just something the equivalent of a yaml file or whatever, but I can see how different kinds of changes like this go through different pipelines. Of course, that is all the more reason to ensure you have integration coverage.