I worked in the cyber security space for a decent chunk of my career, and the most frustrating part was cyber security engineers thinking their problems were unique and being completely unaware of the lessons software engineering teams have already learned.
Yes, you need to tune your canary deployment groups to be large and diverse enough to give a reliable indicator of deployment failure, while still keeping them small enough that they achieve their purpose of limiting blast radius.
Again, if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.
> I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?
I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?
>if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.
You have to ask the customer if they're okay with that citing "our software might failed and brick your machine".
I'd like to see any Sales and Marketing folks say that ;)
> I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?
Uber valid question and this BSOD incident might be a turning point for customers to pay up more for their IT infrastructure.
It's like: previously Cybersecurity vendors are shy to ask customers to setup Canary systems because that's just "one-more-thing-to-do". After BSOD: customers will smarten up and do it without being asked and to the point where they would ask Vendors to _support_ that type of deployment (unless they continue to be cheap and lazy).
> You have to ask the customer if they're okay with that citing "our software might failed and brick your machine".
I think you’re still missing the point of Canary deployments. The question your sales team should ask is “would you like a 5% chance of a bug harming your system, or a 100% chance?”
> It's like: previously Cybersecurity vendors are shy to ask customers to setup Canary systems because that's just "one-more-thing-to-do"
You should by shy because it is not your customer’s job to set up canary deployments. Crowdstrike owns the software and the deployment process. They should be deploying to a subset of machines, measuring the results, and deciding whether to roll forward or roll back. It is not the customers job to implement good release engineering controls for Crowdstrike (although after this debacle you may well see customers try).
If you refer Canary deployment as the vendor's internal deployment? I definitely agree.
What I find it hard is those in Software that suggested to roll it to a few customers first because this isn't cloud deployment doing A/B test when it comes to Virus Definition.
Customers must know what's going on when it comes to virus definition and the implication of them whether they're being part of the rollout group or not.
> If you refer Canary deployment as the vendor's internal deployment? I definitely agree.
No, I’m talking about external deployment to customers. They clearly also had a massive failure in their internal processes too, since a bug this egregious should never make it to the release stage. But that is not what I am talking about right now.
> What I find it hard is those in Software that suggested to roll it to a few customers first because this isn't cloud deployment doing A/B test when it comes to Virus Definition.
I don’t care what you’re releasing to customers— application binary, configuration change, virus definition, etc, if it has the chance of doing this much damage it must be deployed in a controlled, phased way. You cannot 100% one-shot deploy any change that has the potential to boot-loop a massive amount of systems like this. This current process is unacceptable.
> Customers must know what's going on when it comes to virus definition and the implication of them whether they're being part of the rollout group or not.
Who says they don’t have to know? Telling your customers that an update is planned and giving them a time window for their update seems reasonable to me.
I worked in the cyber security space for a decent chunk of my career, and the most frustrating part was cyber security engineers thinking their problems were unique and being completely unaware of the lessons software engineering teams have already learned.
Yes, you need to tune your canary deployment groups to be large and diverse enough to give a reliable indicator of deployment failure, while still keeping them small enough that they achieve their purpose of limiting blast radius.
Again, if you follow industry best practices for software deployment, this is already something that should be considered. This is a relatively solved problem -- this is not new.
> I did post a question: what about other Cybersecurity vendors? Do you think they do canary deployment on their AV definitions?
I think that question is being asked right now by every company using Crowdstrike — what vendors are actually doing proper release engineering and how fast can we switch to them so that this never happens to us again?