So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself. It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.
My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up and then in the critical path of every network connection the computer makes. The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented. Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting. So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
> The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
You're conflating Risk and Impact, and you're not considering the target of that Risk and that Impact.
Failing an audit:
1. Risk: high (audits happen all the time)
2. Impact to business: minimal (audits are failed all the time and then rectified)
3. Impact to manager: high (manager gets dinged for a failing audit).
Compare with failing an actual threat/intrusion:
1. Risk: low (so few companies get hacked)
2. Impact to business: extremely high
3. Impact to manager: minimal, if audits were all passed.
Now, with that perspective, how do you expect a rational person to behave?
[EDIT: as some replies pointed out, I stupidly wrote "Risk" instead of "Odds" (or "Chance"). Risk is, of course, the expected value, which is probability X impact. My post would make a lot more sense if you mentally replace "Risk" with "probability".]
Moreover no manager gets dinged for "internet-wide" outages unfortunately, so the compliance department keeps calling the shots. The amount of times I've had to explain there's no added security in adding an "antivirus" to our linux servers as we already have proper monitoring at eBPF level is annoying.
I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I am responsible for my choices. I'm CTO, I don't doubt that in some cases execs cover for each other, but at least I have anecdotal experience of what it would take for me to be fired- and this is clearly communicated to me.
Hope you get paid a lot! Otherwise you are either in a very young or very stupid job.
I regularly spend multiples of my salary every month on various commitments my company makes, any small mistake could easily mean that its multiples of my salary type of problem within 10 days.
A friend of mine spent half a million on a storage device that we never used. It sat in the IT area for years until we were acquired. Everyone gave him so much shit. Finance asked me about it numerous times (going around my friend the CTO) so they could properly depreciate it. He didn't get dinged by the board at all. It remained an open secret. We were making million dollar decisions once a month, though.
> I regularly spend multiples of my salary every month on various commitments my company makes.
Yeah, same here.
But if I choose a vendor and that vendor fails us so catastrophically as to make us financially insolvent, then it's my job to have run a risk analysis and to have an answer for why.
If it's more cost effective to take an outage, that's fine, if it's not: then why didn't I have a DRP in place, why did we rely so much on one vendor, what's the exposure.
It's a pretty important part of being a serious business person.
Sure, but that's not what I said or you said, and my commentary was about relative measures of your salary to your budget.
If you can't make a mistake of your salary size in your budget then your budget is small or very tight, most corporations fuck up big multiples of their CTOs salary quarterly (but that turns out to be single digit percentage points of anything useful.)
> I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I'm not so sure.
I know of a major company that had a glitch, multiple times, that caused them to lose about ~15 million dollars at least once (a non-prod test hit prod because of a poorly designed too).
I was told the decision-makers decided not to fix the problem (the risk of losing more money again) because the "money had already been lost."
"no manager gets dinged for "internet-wide" outages"
Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
But, seems like for uptime, someone should be identifiable. If your job is uptime, and there is a world wide outage, I'd think it would roll down hill onto someone.
> Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
I wouldn't necessarily say IBM or SAP are "crap". It's much more likely that orgs buying into IBM or SAP don't the due diligence on what the true costs to properly set it up and keep it running, therefore cut tons of corners.
They basically want to own a Ferrari and when it comes to maintenance, they want run Regular gas and try to get their local mechanic to slap Ford parts on it because its too expensive to keep going back to the dealership.
The thing is usually this argument goes something like this:
A: Should prod be running a failover / <insert other safety mechanism>?
B: Yes!
A: This is how much it costs: <number>
B: Errm... Let me check... OK I got an answer, let's document how we'd do it, but we can't afford the overhead of an auto-failover setup.
And so then there will be 2 types of companies, the ones that "do it properly" will have more costs, their margins will be lower, over time they'll be less successful as long as no big incident happens. When a big incident happens though, for most businesses - recent history proves that if everyone was down, nobody really complains. If your customers have 1 vendor down due to this issue, they will complain, but if your customers have 10 vendors down, and are themselves down, they don't complain anymore. And so you get this tragedy of the commons type dynamic where it pays off to do what most people do rather than the right thing.
And the thing is, in practice, doing the thing most people do is probably not a bad yardstick - however disappointing that is. 20 years ago nobody had 2FA and it was acceptable, today most sites do and it's not acceptable anymore not to have it.
Parents may teach this to kids but the kids usually notice their parents don't practice what they preach. So they don't either.
The world is filled with people following everybody else off a cliff. If you're warning people or even just not playing along in a time of great hysteria, people at best ignore your warnings and direct verbal abuse at you. At worst, you can face active persecution for being right when the crowd has gone insane. So most people are cowards who go along to get along.
I think the parent was correct in the use of the word "Risk"; it's different than your definition, which appears to be closer to "likelihood".
Risk is a combination of likelihood and impact. If "risk" were just equivalent to "likelihood" then leaving without an umbrella on a cloudy day would be a "high-risk situation".
A rational person needs to weigh both the likelihood and impact of a threat in order to properly evaluate its risk. In many cases, the impact is high enough that even a low likelihood needs to be addressed.
ZScaler and similar software also has some hidden costs: Performance and all the other fun that comes with a proxy between you and the server you connect to.
> What I'm saying is that the business's interests are not aligned with the people comprising that business.
Yep, that's the point of capitalism.
> In that regard, what "the business" wants is irrelevant.
And yet here we are. Companies get fined left and right for breaching rules but it's ok because it earned them money. There are literal plans made to calculate whether it's profitable to cheat or not. In the current system, what the business wants always wins over individual qualms, unfortunately.
Because the punative system in most countries doesn't affect individuals. As a manager, you're not going to jail for breaking environmental laws, a different entity (the company) is paying for being caught. So, it's still the rational thing to do to break the environment laws to make your groups numbers go up and get a promo or bonus.
Almost correct, but you mean 'chance' where you write 'risk':
Risk = Chance × Impact
The chance of failing an audit initially are high (or medium, present at least). The impact is usually low-ish. It means a bunch of people need to fix policy and set out improvement plans in a rush. It won't cost you your certification if the rectification is handled properly.
It's actually possible that both of your examples are awarded the same level of risk, but in practice the latter example will have its chance minimized to make the risk look acceptable.
> Now, with that perspective, how do you expect a rational person to behave?
They'd deploy the software on the critical path. That's exactly GP's point, isn't it? That's why GP explicitly wants us to shift some of the blame from the business to the regulators. GP advocates for different regulatory incentives so that a rational person would then do the right thing instead of the wrong thing.
I’m at risk of sounding like chicken little, the reality is companies are getting popped all the time - you just don’t hear about them very often. The bar for media reporting is constantly being raised to the point where you only hear about the really big ones.
If you read through any of the weekly Risky Biz News posts [1] you’ll often see a five or more highly impactful incidents affecting government and industry, and they’re just the reported ones.
I wonder how much that's still true now that ransomware has apparently become viable.
Finding an insecure target, setup the data hostage situation, have the victim come to pay is scalable and could work in volume. If getting small money from a range of small targets becomes profitable, small fishes will bear sinilar risks to juicier targets.
But...surely you're also missing another point of consideration:
Single point of failure fails, taking down all your systems for an indeterminate length of time:
1. Risk: moderate (an auto-updating piece of software without adequate checks? yeah, that's gonna fail sooner or later)
2. Impact to business: high
3. Impact to manager: varies (depending on just how easy it is to spin the decision to go with a single point of failure rather than a more robust solution to the compliance mandate)
> 3. Impact to manager: minimal, if audits were all passed.
I don't know about you, but I'll be making sure everyone knows that the manager signed off on the spectacularly stupid idea to push through an update on a friday without testing.
Of course, disabling those auto updates will have you fail the external security audit and now your security team needs to fight with the rest of the leadership in the company explaining why you're generating needless delays, costs against the "state of the art in security industry" and why your security guys are smarter than the people who have the power to approve or deny your security certification.
I've taken part in some security audits where I work. They're not a joke only because they're a tragic story of incompetence, hubris, and rubberstamping. They 100% focus on checking boxes and cargo-culting, while leaving enormous vulnerabilities wide open.
What I don't understand is why they don't have a canary update process. Server side deployments do this all the time. You would think Windows would offer that to their institutional customers, for all types of updates including (especially) 3rd party.
This isn't a Windows update (which absolutely does let you do blue/green deployments vis SUS), but rather a Crowdstrike update which also lets you stage rollouts and I expect several administrators are finding out why that is important.
I know about update policies, but afaik those are about the “agent” version. Today’s update doesn’t look like an agent version. The version my box is running was released something like a week ago.
Is there some possibility tu stage rollouts of the other stuff it seems to download?
Kind of a big thing most people don't understand about the various forms of "Business Insurance." For the most part, businesses have whatever insurance whatever they are doing requires them to have. Those requirements are set by laws/regulations applied to those entities and the various entities they want to do business with.
At every small shop I've worked when the topic of Business Insurance came up with one of the owners, the response was extremely negative -- basically summarized as "it's the most you will ever pay for something you won't ever be able to use".
Yep, it’s pretty much a toll on doing business with entities. I’ve no doubt the intention is so your customer can sue you without you winding up, whether it actually works… no idea.
>> It's easy to blame CrowdStrike but that seems too easy on both the orgs that do this but also the upstream forces that compel them to do it.
While orgs using auto update should reconsider, the fact that CrowdStrike don't test these updates on a small amount of live traffic (e.g. 1%) is a huge failure on their part. If they released to 1% of customers and waited even 24 hours before rolling out further this seems like it would have been caught and had minimal impact. You have to be pretty arrogant to just roll out updates to millions of customers devices in one fell swoop.
Why even test the updates on a small amount of live customers first? Wouldn't this issue already have surfaced if they tested the update on a handful of their own machines?
You are completely right. BTW It wasn't a software update, it was a content update, a 'channel file'.
Someone didn't do enough testing. edit: or any testing at all?
It's an automatic update of the product. Semantic "channel vs. binary" doesn't indicate anything. If your software's definition files can cause a kernel mode driver to crash in a bootloop you have bigger problems, but the outcome is the same as if the driver itself was updated.
Indeed. Its worse really, it means there was a bug lurking in their product that was waiting for a badly formatted file to surface it.
Given how widespread the problem is it also means they are pushing these files out without basic testing.
edit: It will be very interesting to see how CrowdStrike wriggle out of the obvious conclusion that their company no longer deserves to exist after a f*k up like this.
That's funny, because IIRC McAfee back in the Windows XP days did this exact same thing! They added a system file to the signature registry and caused Windows computers to BSOD on boot.
That’s even worse—-they should be fuzz testing with bad definitions files to make sure this is safe. Inevitably the definitions updates will be rushed out to address zero days and the work should be done ahead of time to make them safe.
Having spent time reverse-engineering Crowdstrike Falcon, a lot of funny things can happen if you feed it bad input.
But I suspect they don't have much motivation to make the sensor resilient to fuzzing, since the thing's a remote shell anyways, so they must think that all inputs are absolutely trusted (i.e. if any malicious packet can reach the sensor, your attackers can just politely ask to run arbitrary commands, so might as well assume the sensor will never see bad data..)
This is something funny to say when the inputs contain malware signatures, which are essentially determined by the malware itself.
I mean, how hard would it be to craft a malware that has the same signature as an important system file? Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.
Even if the signature is not completely predictable, the bad guys can try as often as they want and there would not even be way to detect these attempts.
> malware signatures, which are essentially determined by the malware itself.
No they're not. The tool vendor decides the signature, they pick something characteristic that the malware has and other things don't, that's the whole point.
> how hard would it be to craft a malware that has the same signature as an important system file?
Completely impossible, unless you mean, like, bribe one of the employees to put the signature of a system file instead of your malware or something.
Sure, but they do it following a certain process. It's not that CrowdStrike employees get paid to be extra creative in their job, so you likely could predict what they choose to include in the signature.
In addition to that, you have no pressure to get it right the first time. You can try as often as you want and analyzing the updated signatures you even get some feedback about your attempts.
Like, «We require that your employees opens only links on white list, and social networks cannot be put on this list, and we require managed antivirus / firewall solution, but we are Ok that this solution has backdoor directly for 3rd party organization»?
It is crazy. All these PCI DSS and SOC2 looks like a comedy if they allow such things.
At a former employer of about 15K employees, two tools come to mind that allowed us to do this on every Windows host on our network[0].
It's an absolute necessity: you can manage Windows updates and a limited set of other updates via things like WSUS. Back when I was at this employer, Adobe Flash and Java plug-in attacks were our largest source of infection. The only way to reliably get those updates installed was to configure everything to run the installer if an old version was detected, and then find some other ways to get it to run.
To do this, we'd often resort to scripts/custom apps just to detect the installation correctly. Too often a machine would be vulnerable but something would keep it from showing up on various tools that limit checks to "Add/Remove Programs" entries or other mechanisms that might let a browser plug-in slip through, so we'd resort various methods all the way down to "inspecting the drive directory-by-directory" to find offending libraries.
We used a similar capability all the way back in the NIMDA days to deploy an in-house removal tool[1]
[0] Symantec Endpoint Protection and System Center Configuration Manager
[1] I worked at a large telecom at that time -- our IPS devices crashed our monitoring tool when the malware that immediately followed NIMDA landed. The result was a coworker and I dissecting/containing it and providing the findings to Trend Micro (our A/V vendor at the time) maybe 30 minutes before the news started breaking and several hours before they had anything that could detect it on their end.
Hilariously, my last employer was switching to Crowdstrike a few months ago when my contract ended. We previously used Trellis which did not have any remote control features beyond network isolation and pulling filesystem images. During the Crowdstrike onboarding, they definitely showed us a demo of basically a virtual terminal that you could access from the Falcon portal, kind of like the GCP or AWS web console terminals you can use if SSH isn't working.
As I understand, this only manifests after a reboot and if the 'content update' is tested at all it is probably in a VM that just gets thrown away after the test and is never rebooted.
Also, this makes me think:
How hard would it be to craft a malware that has the same signature as an important system file?
Preferably one that doesn't cause immediate havoc when quarantined, just a BSOD after reboot, so it slips through QA.
I don't believe this is what's happened, but I think it is an interesting threat.
Nope, not after a reboot. Once the "channel update" is loaded into Falcon, the machine will crash with a BSOD and then it will not boot properly until you remove the defective file.
> How hard would it be to craft a malware that has the same signature as an important system file?
Very, otherwise digital signatures wouldn’t be much use. There are no publicly known ways to make an input which hashes to the same value as another known input through the SHA256 hash algorithm any quicker than brute-force trial and error of every possibility.
This is the difficulty that BitCoin mining is based on - the work that all the GPUs were doing, the reason for the massive global energy use people complain about is basically a global brute-force through the SHA256 input space.
I was talking about malware signatures, which do necessarily use cryptographic hashes. They are probably more optimized for speed because the engine needs to check a huge number of files as fast as possible.
Cryptographic hashes are not the fastest possible hash, but they are not slow; CPUs have hardware SHA acceleration: https://www.intel.com/content/www/us/en/developer/articles/t... - compared to the likes of a password hash where you want to do a lot of rounds and make checking slow, as a defense against bruteforcing.
That sounds even harder; Windows Authenticode uses SHA1 or SHA256 on partial file bytes, the AV will use its own hash likely on the full file bytes, and you need a malware which matches both - so the AV will think it's legit and Windows will think it's legit.
AFAIK important system files on Windows are (or should be) cryptographically signed by Microsoft. And the presence of such signature is one of the parameters fed to the heuristics engine of the AV software.
> How hard would it be to craft a malware that has the same signature as an important system file?
If you can craft malware that is digitally signed with the same keys as Microsoft's system files, we got way bigger problems.
>How hard would it be to craft a malware that has the same signature as an important system file?
Extremely, if it were easy that means basically all cryptography commonly in use today is broken, the entire Public Key Infrastructure is borderline useless and there's no point in code signing anymore.
Admittedly, I don't know exactly what's in these files. When I hear 'content' I think 'config'. This is going to be very hypothetical, I ask for some patience. Not arguments.
The 'config file' parser is so unsafe that... not only will the thing consuming it break, but it'll take down the environment around it.
Sure, this isn't completely fair. It's working in kernel space so one misstep can be dire. Again, testing.
I think it's a reasonable assumption/request that something try to degrade itself, not the systems around it
edit: When a distinction between 'config' and 'agent' releases is made, it's typically with the understanding that content releases move much faster/flow freely. The releases around the software itself tend to be more controlled, being what is actually executed.
In short, the risk modeling and such doesn't line up. The content updates get certain privileges under certain (apparently mistaken) robustness assumptions. Too much credit, or attention, is given to the Agent!
"All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented."
Great statement and one that needs to be seriously considered - would DORA regulation in the EU address this I wonder? Its a monster piece of tech legislation that SHOULD target this but WILL it - someone should use todays disaster and apply it to the regs to see if its fit for purpose.
Emphatically NO. Involved in (IT) Risk and DORA in a firm that actually does IT risk scenario planning (the sort opposite of checkbox compliance). DORA is rubber stamping al the way round. One caveat is that we are way ahead of DORA, so treating DORA as a checkbox exercise might be situational. But I haven’t noticed a place where the rubber hits the road regulatory wise. It’s too easy to stay in checkbox compliance if the board doesn’t see IT-risk as a major concern. I’m happy one of our board members does. We’ve gone so far as to introduce a person and paper based credit line, so we can continue an outgoing cashflow if most of our processes fail (for an insurer).
Well, yeah. If a regulation is broken and not achieving its goal it should be changed. What's the alternative? "Regulation? We tried that once and it didn't work perfectly, so now we let The Market™ sort out safety standards."
Who needs regulation when you can have free Fentanyl with your CrowdStrike subscription! All of your systems will go down, but you won't care, and the chance of accidental overdose is probably less than 10%!
Yes, in many contexts that may well be the correct conclusion. Your comment presumes that regulation here has proven itself useful and not resulted in a single point of failure which potentially reduces overall safety. It’s of course the correct comment from a regulator’s perspective.
For the market to work wouldn't you need something to hold the corps accountable if they fail to be secure AND to make regular people whole if the crops' failures cause them problems?
Especially for something like technology and infosec which rapidly changes, it’s silly to look to slow moving regulations as a solution, not to mention ignoring history and gambling politicians will do it competently and it won’t have negative side effects like distracting teams from doing real work that’d actually help.
You can make fines and consequences after the fact for blatant security failures as incentives but inventing a new “compliance” checklist of requirements is going to be out of date by the time it’s widely adopted and most companies do the bare minimum bullshit to pass these checklists.
There are so many english centric assumptions here.
Regulation of liability can be very generic and broad, with open standards that dont need to be updated.
Case in point: Most of continental Europe still uses Napoleon's code civile to prescribe how and when private parties are liable. This is more than 150 years old.
The real issue is that most Americans are stuck with an old English regulatory system, which for fear of overreach was never modernized.
This can be true of security (and every other expense) whether it's regulated or not. Which do you think will result in fewer incidents: the regulated bare minimum, or the unregulated base minimum?
> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
No, we need to hold Architects accountable, and this is the core of the issue. Creating systems with single, outsourced responsibility, in the critical path.
This is the point of much of the security efforts we see now.
Outsourcing of security functions, and things like login push a lot of liability and legal issues off into someone else's house.
It's hard to be the source of a password leak, or be compromised when you don't control the passwords. But like any chain your only as secure as your weakest link... Snowflake is a great current example of this. Mean while the USPS just told us "oops" we had tracking pixels for a bunch of vendors all over our delivery preview tool.
Candidly, most people stacks look a lot less like software and more like a toolbar riddled IE5 install circa 2000. I don't think our industry is in a good place.
This is one of the interesting aspects in Ethereum.
If your validator is down, you lose a small amount of stake, but if a large percentage of the total set of validators are down, you all start being heavily penalized.
This incentives people running validators to not use the most popular Ethereum client, to avoid using a single compute provider, and to overall, avoid relying on the popular choice since doing so can cause them to lose the majority of their stake.
There hasn't been a major Ethereum consensus outage, but when that happens, the impact of being lazy and following the heard will be huge.
How is it lazy and herd-like to _not_ run the latest and greatest? Sounds like Etherium's design is promoting a robustly diverse ecosystem rather than a monoculture.
> How is it lazy and herd-like to _not_ run the latest and greatest?
I'm not sure what you're asking here. Ethereum incentives don't make you run the latest version of your client's software (unless there's a hardfork you need to support). You can run any version that follows the network consensus rules.
The incentives are there to punish people who use the most common software. For example, let's say there are around 5 consensus clients which are each developed by independent teams. If everyone ran the same client, a bug could take down the entire network. If each of those 5 clients were used to run 20% of the network, then a bug in any one of them wouldn't be a problem for Ethereum users and the network would keep running.
If the network is evenly split across those 5 clients but all of them are running in AWS, then that still leaves AWS as a sigle point of failure.
The incentives baked into the consensus protocol exist to push people towards using a validator client that isn't used by the majority of other validators. That same logic applies to other things like physical host locations, 3rd party hosting providers, network providers, operating systems, etc... You never want to use the same dependencies as the majority of other validators. If you do and a wide-spread issue happens, you're setting yourself up to lose a lot of money.
It sounds like you're describing the advantages of diversity, with a little game theory thrown in to sweeten the deal. Still not sure how that can be described as lazy, or did I completely mis-read the original phrasing?
I find that in today's world it is no longer about one person being "accountable". There is always an interplay of factors, like others have pointed out cyber security has a compliance angle. Other times it is a cost factor, redundancy costs money. Then there is the whole revolving door of employees coming and going, so institutional knowledge about why a decision was made lost with them.
That is hard to do for even a small company. How do you balance all that out for critical infrastructure at a much larger scale?
The problem is that even knowing that this likely to happen many companies would still put CrowdStrike into a critical system for the sake of security compliance / audit. And it's not even prioritization of security over reliability because incentives are to care more about check-boxes in the audit report than about the actual security. Looks like almost no party in this tragic incident had a strong incentive to prevent it so it's likely to happen again.
Can anyone explain how CrowdStrike could possibly fix this now? If affected machines are stuck in an endless BSOD cycle, is it even possible to remotely roll out a fix? My understanding is that the machines will never come to the point where a CS update would be automatically installed. Is the only feasible option the official workaround of manually deleting system files after booting into the recovery environment? How could this possibly be done on scale in organizations with tens of thousands of machines?
There are orgs out there right now with 50,000+ systems in a reboot loop. Each one needs to me manually configured to disable CS via safe mode so that the agent version can be updated to the fixed version. Throw bitlocker in the mix which makes this process even longer, we're talking about weeks of work to recover all systems.
CrowdStrike itself will not fix anything. They published a guide on how to workaround the problem and that's it. Most likely a lot of sales reps and VPs will be fielding calls all over the weekend explaining large customers how did they manage to screw up and how much discount will they offer on the next renewal cycle.
Legally, I think somewhere in their license it says is that they're not responsible in any way or form if their software malfunctions in any way.
Like if I kill someone of course I go to jail. But if I get some people together, say we're a company, and then kill 100 people, nobody goes to jail. How does that work? What a huge loophole.
Phillips (the company) basically killed people with malfunctioning CPAP machines (which are meant to help against sleep apnea) and no one went to jail. So that's a practical example.
It's already the norm for devs to not be responsible for software malfunctions. They can choose to end their relationship with you, but they can't sue you for damages.
Yep, I've been involved in many vender contracts at my company and the contracts take weeks to months to finalize because every aspect of the agreement is up for discussion. Even things like SLA's (including how they're calculated), liability limitations, indemnity, recourse in the event of system failure are all put through the ringer until both sides come to agreeable terms. This is true for big and tiny venders.
This isn't a Github project with a MIT license. When you do B2B software, there aren't software licenses, there are contractual terms and conditions. The T&Cs outline any number of elements but including SLAs, financial penalties for contractual breaches, etc. Larger customers negotiate these T&Cs line by line. Smaller customers often accept the standard T&Cs.
Penalties, as far as I was involved in vendor discussions, are a part of the negotiation only when the software provider does any work on the client's premises and are liable to that extent.
For software, you don't pay penalties that it might malfunction once in a while, that's what bug-fixes are for and you get offered an SLA for that, but only for response time, not actual bug fixing. Where you do get penalties and maybe even your money back, is when the software is listed as being able to do X,Y,Z and it only does X and Z and the contract says it must do everything it said it does.
Well, probably no?
I've never seen liabilities in dollar value, or rather any significant value. Also I saw our company Ceowdstrike contract for 10k+ seats, no liabilities there.
Sounds like people in some of these environments will be doing their level best to automate an appropriate fix.
Hopefully they have IPMI and remote booting of some form available for the majority of the affected boxes/VMs, as that could likely fix a large chunk of the problem.
Imagine if North Korea comes with a statement, that they did it.. It would spawn such amount of work internally at CS to proof if it was intentional or a simple mistake.
I work for government organization that is constantly audited and I've seen this play out over and over.
An important aspect I never see mentioned is most Cyber Security personnel don't have the technical experience to truly understand the systems they are assessing, they are, like you said, just pushing to check those compliance boxes.
I say this as someone who is currently in a Cyber Security role, unfortunately, as I'm coming to learn cyber roles suck. But this isn't a jab at those Cyber Security personnel's intelligence. It's literally impossible to understand multiple systems at a deep level, it takes employees working on those systems weeks to months to understand this stuff, and that's with them being in the loop. Cyber is always on the outside looking in, trying like hell to piece it all together.
Sorry for the rant. I just wanted to add on with my personal opinions on the cyber security framework being severely broken because I deal with it on a daily basis.
> It's literally impossible to understand multiple systems at a deep level,\
No, it's not. It takes above average intelligence, and major investment in actual education (not just "training"), and actual depth of experience, but it's not impossible.
Do you think it comes from a fundamental misconception of how these roles should be structured? My take is that you just can't fundamentally assess technical elements from the outside unless they have been designed that way in the first place (for assessability). For example I educate my team that they have structure their git commits in a way that demonstrates their safety for audit / compliance purposes (never ever combine a high risk change with a low risk one, for example). That should go all the way up the chain. Failure to produce an auditable output is failure to produce an output that can be deployed.
I know of an important company currently pushing to implement a redundant network data loss prevention solution, while they don't have persistent VPN enabled and multiple known misconfigurations of things that prevent web decryption working properly.
The flip side is, if you don't do auto updates and an exploit is published and used against you and you haven't yet tested / pushed the patch, that you would have been protected against if it had auto updated, you are up the creak without a paddle in that situation as well.
To some degree you have to trust the software you are using not to mess things up.
So since I do mission critical healthcare I do run into this concept. But it's not as unresolvable as you portray. Consider for example HIPAA "break the glass" requirement. It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.
Similarly, when I questioned, "why can't users turn off ZScaler in an emergency" we were told that it wouldn't be compliant. But it's completely implementable at a technical level (Zscaler even supports this). You give users a code to use in an emergency and they can activate it and it will be logged and reviewed after use. But the org is too scared of compliance failure to let users do it.
Well, if the vault says you have COPD, and the devious bank robber is interested in your continued breathing, perhaps we can just review the footage after the fact.
This is one of those cases where you don't disable emergency systems to defend against rogue employees. If people abuse emergency procedures, you let the legal system sort it out.
> It says that whatever else you implement in terms of security you must implement a bypass that can be activated by routinely non-authorised staff to access health information if someone's life is in danger.
Huh.
I can see why this needs to exist, but hadn't thought of it before. Same deal as cryptography and law-enforcement backdoors.
> logged and reviewed after use
I was going to ask how this has protection from mis-use.
Seems good to me… but then I don't, not really, not deeply, not properly, feel medical privacy. To me, violation of that privacy is clearly rude, but how the bar raises from "rude" to "illegal" is a perceptual gap where, although I see the importance to others, I don't really feel it myself.
So it seems good enough to me, but am I right or is this an imagination failure on my part? Is that actually good enough?
I don't think cryptography in general can use that, unfortunately. A simple review process can be too slow for the damage in other cases.
This is an oversimplification. IF we are talking about compliance to ISO 27001 you are supposed to do your own risk assessment and implement necessary controls. The auditor will basically just check that you done the risk assessment, and that you have done the controls you said yourself you need to do.
I'd say this has nothing with regulatory compliance to do at all. The real truth is that modern organizations are way too attached to cloud solutions. And this runs across all parts of the organization with Saas and PaaS whether it's email (imagine Google Workspace having a major issue), AWS, Azure, Okta…
I've had the discussions so many times and the answer is always – the risks doesn't matter because the future is cloud and even talking about self hosting anything is naive and honestly we need to evaluate your competence for even suggesting it.
(Also the cloud would maybe not be this fragile if it wasn't for lock-in with different vendors. If you read the TOS it says basically on all cloud services that you are responsible for the backup – but getting your data out of the service is still pain in the ass – if possible at all)
> The real truth is that modern organizations are way too attached to cloud solutions.
I'm confused. This is a security product for your local machine. Not the cloud.
Unless you call software auto-update "the cloud", but that's not what people usually mean. The cloud isn't about downloading files, it's about running programs and storage remotely.
I mean, if CloudStrike were running entirely on the cloud, it seems like the problem would be vastly easier to catch immediately and fix. Cloud engineers can roll back software versions a lot easier than millions of end users can figure out how to safe boot and follow a bunch of instructions.
Well, in all times usually there has been the option to run a local proxy/cache for your updates so that you can properly test them inside your own organization before rolling them out to all your clients (precisely to avoid this kind of shit show). But doing that requires an internal team running it and actually testing all updates. But modern organizations don't want an IT-department, they want to be "cloud first". So they rely on services that promise they can solve everything for them (until they don't).
Cloud is not just about where things are – it's also about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere on the other side of the globe – or "in the cloud".
> Cloud is not just about where things are – it's about the idea that you can outsource every single piece of responsibility to a intangible vendor somewhere in the cloud.
I've never heard of a definition of cloud like that.
Cloud is entirely about where things are.
Outsourcing responsibility to a vendor is totally orthogonal to the idea of the cloud. You can outsource responsibility in the cloud or not. You can also outsource responsibility on local machines or not.
And outsourcing responsibility has existed since long before the concept of the cloud was invented.
The product affected here is litelarly called "CrowdStrike Falcon® Cloud Security". Meraki all tough they sell routers and switches markets their products as "cloud-based network platform". Jamf all tough their product is run on endpoint devices is marked as "Jamf Cloud MDM". I think its fair to say that cloud these days does not only mean storing data, or running servers in cloud but also if infrastructure is in any way MANAGED in cloud.
So to tie back to what i wrote earlier – none of these services has to have the management part in the cloud. They could just give you a piece of software to run on your own server. That would certainly distribute the risk since now it only takes someone hacking the vendor to go after all their customers, or in this case one faulty update brakes all users experience. And as far as I can see it seems we are willing to take those risks because we think it's nice having someone else manage the infrastructure (and that was my main point in the first comment).
> My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up
Hi fellow CVS employee. Are you enjoying your zscaler induced SSO outages every week that torpedo access to email and every internal application? Well now your VMs can bluescreen too. A few more vendor parasites and we'll be completely nonfunctional. Sit tight!
When we think "security" on HN we think about the people who escalate wiggling voltages at just the right time into a hypervisor shell on XBox, but I've had to recognize that my learned bias is not correct in the real world. In the real world, "computer security" is a profession full of hucksters that can't tell post-quantum from heap and whose daily work of telling people repeatedly to not click links in Outlook and filling out checklists made by people exactly like them has essentially no bearing on actual security of any sort.
It's driven by a lot of things. Part of it is driven by rising cyber liability insurance rates, for one. A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.
Fundamentally there's a lot of factors in this ecosystem. It's really wild how incentives that seem unrelated end up with crazy "security" products or practices deployed.
> A lot of organizations would rather not pay for CrowdStrike, but the premiums for not having an "EDR/XDR/NGAV" solution can be astoundingly high at-scale.
Just like a lot of homeowners would rather not pay for ADT, but insurance requires a box-ticking “professionally-monitored fire alarm system.” Nevermind that I can dial 911 as well as the “professional” when I get the same notification as they do.
> In the real world, "computer security" is a profession full of hucksters
Always has been. The information security model is about analogizing digital systems as physical systems, and employing the analogues of those physical controls that date back hundreds of years on those digital systems. At no point, in my relatively long career, have I ever met anyone in Information Security who actually understands at depth anything about how to secure digital systems. I say this as someone who has spent a lot of my career trying to do information security correctly, but from the perspective of operations and software engineering, which is where it must start.
The entire information security model the world works with is tacking on security after the fact, thinking you need to builds walls and a vault door to protect the room after the house has already been built, when in fact you need to build the house to be secure from the start because attacks don't go through doors, attacks are airborne (I recognize the irony of my analogizing digital concepts to physical concepts surrounding security, but I do it because of any infosec people that may read my comment so they can understand my point).
Because of this model, we have gone from buying "boxes" to buying "services", but it has never matured away from the box-checking exercise it's been since day one. In fact, many information security people have /no training or education/ in security, it's entirely in regulatory compliance.
I’ve met highly paid “security engineers” that talked about not really being into programming or being okay with python but everything else is too complicated.
It shocks me that such a low level of technical competence is required.
> So CrowdStrike is deployed as third party software into the critical path of mission critical systems and then left to update itself.
TIL that US government has pressured foreign nations to install a mystery blob in the kernel of machines that run critical software "for compliance".
If this wasn't a providential goof on the part of Crowdstrike -- the entire planet is now aware of this little known fact -- then some helpful soul in Crowdstrike has given us a heads-up.
Don't put your eggs in one basket, I use multiple anti-virus products so that if one blows up at least not all computers are affected. Looks like my old wisdom is still new wisdom.
Clarification: I mean that every computer has one anti-virus product, but not every computer has the same anti-virus product. I'm not installing multiple anti-virus products on the same computer.
You use multiple anti-virus products. Let's assume you use 3. Do you have multiple clusters of machines, each running their own AV product, so in case one has this problem the other two are unaffected?
How much overhead are we talking about here? Because if you're just using multiple AV software installed on one machine, 1) holy shit, the performance penalty, 2) you'd still be impacted by this, as CS would have taken it down.
They surely mean that all odd number assets are running crowdstrike and even are running sential-one (or similar, %3, %4, etc etc). At least then you only lose half your estate.
I have never seen a company that uses multiple AV products rolled out to user machines, ever. Sure, when you transition from one product to another, but across the whole company, at the same time? Never... I have also never seen a distribution of something like active directory servers based on antivirus software. I think these stories are purely academic, "why didn't you just..." tall tales.
Mine certainly does, our key windows based control systems use windows defender, the corporate crap gets sentinal one and zscaler and whatever else has been bought on a whim.
I'd assumed that any essential company would be similar. OK if your purchasing systems for your hospital are down for a couple of days it's a pain. If you can't get x-rays it's a catastrophe.
If half your x-ray machines are down and half are up, then it's a pain, but you can prioritise.
But lots of companies like a single supplier. Ho hum.
Not the person you're replying to, but in any reasonable organization with automated software deployment it should be easy to pool machines into groups, so you can make sure that each department has at least one machine that uses a different anti-virus software.
Bonus, in case you do catch a malware, chances are higher that one of the three products you use will flag it.
So you have multiple AV products and you target those groups. You have those groups isolated on their own networks, right? With all the overhead that comes with strict firewall rules and transmission policies between various services on each one. With redundant services on each network... you've doubled or tripled your network device costs solely to isolate for anti virus software. So if only one thing finds the zero day network based virus, it won't propagate to the other networks that haven't been patched against this zero day thing.
How far down the rabbit hole do we want to go? If you assume many companies are doing this kind of thing, or even a double digit percentage of companies, I have bad news for you.
But cost of maintenance aside it wouldn't be that bad to deploy each half the fleet with two distincts EDR.
This is actually implicitly in place for big companies that support BYOD. If half your fleet is on Windows another 40% on MacOs and 10% on Linux you need distinct EDR solutions and a single issue can't affect all your fleet at once.
I know a few people who have Zscaler deployed at work. It will routinely kick them of the internet, like multiple times a day. It has gotten to the point where they can sort of tell in advance that it's about to happen.
The theory so far it that it's related to their activities, working in DevOps they will sometimes generate "suspicious" traffic patterns which will then trigger someone policy in Zscaler, but they're not actually sure.
ZScaler itself uses port 443 UDP, but blocks QUIC. The last time I checked it didn't support IPv6 so they told customers to disable IPv6. Security software is legacy software out of the box and cuts the performance of computers in half.
> more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
Duh, else there would be no need to audit them to force compliance, they'd just do it by themselves. The only reason it needs forcing is that they otherwise aren't motivated enough.
> Good point. But the audit seems useless now. It's supposed to prevent the carelessness from causing... this thing that happened anyway.
> Sure, maybe it prevented even more events like this from happening. But still.
Because the point of audit is not to prevent hacks, it's to prove that you did your due diligence to not get hacked, so fact that hack happened is not your fault.
You can hide under umbrella of "sometimes hacks happen no matter what you do".
CYA is the reason you do the audit. But the reason for the audit's existence and requirement is definitely so that hacks don't happen. Don't tell me regulatory agencies require things so that companies can hide behind them.
Who is them though? The airport that used this software? You can't put all the blame on the software vendor. It can be a good and useful component when not relied on exclusively for the functioning of the airport. Not relying on a single point of failure should be the responsibility of the business customer who knows the business context and requirements.
You will have each company person pointing at the others. That's why you have contracts in place.
You won't ever have real consequences for executives and real decision makers and stakeholders because the same kind of people make the laws. They are friends, revolving door etc.
There's no responsibility at any level, is the thing. Those people who couldn't fly might get a rebooking and some vouchers sent out to them, but they won't really get made whole. The airport knows they won't really be on the hook, so they don't demand real responsibility from their vendors, and so on.
In the grand scheme of things, being able to fly around the globe at these prices is a pretty good deal, even with these rare events taken into account. It's not like the planes fell out of the sky. If you must must definitely be somewhere at a time, plan to arrive one or two days earlier.
I don't even want to know how many mission critical systems automatically deploy open source software downloaded from github or (effectively random) public repositories.
Unlike Windows, there is at least the option to use curated software distributions such as Debian or RH that won't apply random stuff from upstream repositories.
If I were running an organization that needs these audits, I'd always have fallback procedures in place that would keep everything running even if all computers suddenly stop working, like they did today. General-purpose software is too fragile to be fully relied upon, IMO.
If a general-purpose computer must be used for something mission-critical, it should not have an internet connection and it should definitely not allow an outside organization to remotely push arbitrary kernel-mode code to it. It should probably also boot from a read-only OS image so that it could always be restored to a known-good state by just rebooting.
Organizations don't want to increase risk by listening to an employee with their personal opinion. Orgs want an outside vendor who they can point at and say "it's their fault", and await a solution. Employees going rogue and not following the vendor defined SW updates is a much higher risk than this particular crisis.
Isn't there a way to schedule the updates? With Windows updates, when I used to work at a firm with a critical system running on Windows, we had main and DR servers and the updates were scheduled to first rollout on the main server and a day after I think at the DR, which has saved us at least once in the past from a bad Windows update...
More or less. You can set up some update policies which and apply those to subsets of your machines. You can disable updates during time blocks, or block them altogether. There's also the option of automatically installing the "n-1" update.
We run auto n-1 at work, but this also happened at the same time on my test machine with runs "auto n". It never happened before, so this looks like something different than the actual installed sensor version, especially since the latest version was released something like a week ago.
It's a big stretch to call this the regulator's fault when its basic lack of testing by Microsoft and/or Crowdstrike. If a car manufacturer made safety belts that broke, you don't blame the regulators.
The root cause is automatic, mindless software update without proper testing - nothing to do with regulators.
That's some very twisted logic. If I expect someone to clean the kitchen as part of restaurant closeup checklist, and they fuck it all up, would I blame the checklist, or the person doing the work?
You blame the person fucking it up. In this case, it's someone who only cares about checking a box. Or someone who pushes broken shit.
If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person. You blame the checklist which encouraged giving a single person global interlocked control over millions of kitchens, without any compartmentalization.
> If this person simultaneously fucks up millions of kitchens around the world, you do not blame that person.
No, you definitely do, even more than before. Let's say for example that the requirement is to disinfect anything that touches food. And the bleach supplier fucks it all up. You blame the bleach supplier. You don't throw out the disinfectant requirement.
Most enterprises will have teams of risk and security people. They will be asking who authorized deployment of an untested update into production. If CrowdStrike deployments cannot be managed, then they will switch to a product which can be managed.
Well, if you fail at compliance, you can be fired and sometimes even sued. If your compliance efforts cause system wide outage - nobody's to blame, shit happens. I predict this screwup will end up with zero consequences for anyone who took the decisions that led to it too. So how else do you expect this system to evolve, given this incentive structure?
> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
If a failed audit is the big scary monster in their closet, then it sounds like the senior leadership is not intimately familiar with the technology and software in general, and is unable to properly weigh the risks of their decisions.
More and more companies are becoming software companies whether they like it or not. The software is essential to the product. And just like you would never want a non-lawyer running your law firm, you don't want a non-software person running your software company.
Very sharp and to the point, this comment. I would like to add that in large companies the audit will, in my experience, very often examine documents only -- not actual configuration or code.
This is all well deserved for executives who trust MS to run their businesses. If you have the resources, like a bank, it is crime to put your company in the hands of MS.
It's possible that CrowdStrike heavily incentivises being left to update itself.
Removing the features that would allow sysadmins to actually do it automatically, even via the installer itself- would definitely be one way, but another one could be aggressive focus-stealing nags (similar to Windows' own nags) which in a server environment can actually cause some major issues, especially when automating processes in Windows (as you need to close the program when updating).
I think it's easy to blame the sysadmins, but I would also be remiss if I didn't point out that in the Windows world we have been slowly accepting these automatic dark patterns and alternative (more controlled) mechanisms have been removed over time.
I almost don't recognise the deployment environment today as to what it was in 2004; and yes, 20 years is a long time, but the total loss of control over what a computer is doing is only going to make issues like this significantly more common.
They say it was caused by a faulty channel file. I don't know what a channel file is, and they claim to not rely on virus signatures, but typically anti virus product need the latest signatures all the time and poll them probably once an hour or so. So I'm not surprised that an anti virus product wants to stay hyper updated and updates are rolled out immediately to everyone globally.
No, I'm not surprised either. But if you're operating at this kind of scale and with this level of immediate roll-out, what I would expect are:
* A staggered process for the roll-out, so that machines that are updated check-in with some metrics that say "this new version is OK" (aka "canary deployment") and that the update is paused/rolled back if not.
* Basic smoke testing of the files before they're pushed to any customers
* Validation that the file is OK before accepting an update (via a checksum or whatever, matched against the "this update works" automated test checksums)
* Fuzz tests that broken files don't brick the machine
Literally any of the above would have saved millions and millions of dollars today.
In any kind of serious environment the admin should not have any interaction with any system's screen when performing any kind of configuration change. If it can't be applied in a GPO without any interaction it has no business being in a datacenter.
There are situations where you will interact with the desktop, for debugging reasons not-withstanding. Saying anything else is hopelessly naive. For example: how do you know if your program didn't start due to missing DLL dependencies? There is no automated way: you must check the desktop because Windows itself only shows a popup.
2) What displays on the screen is absolutely material to the functioning of the operating system.
The windows shell (UI) is intertwined intrinsically with the NT kernel, there have been attempts to create headless systems with it (Windows Core etc;) however in those circumstances if there is a popup: that UI prompt can crash the process because it does not have dependencies to show the pop-up.
If you're in a situation where you're running windows core, and a program crashes if auto-updates are not enabled... well, you're more likely than not to enable updates to avoid the crash, after all, whats the harm.
Elsewise you will be aware that when a program has a UI (windows console) the execution speed of the process will be linked to the draw rate of the screen, so having a faster draw rate or fewer things on screen can actually affect performance.
Those that write Linux programs are aware that this is also true for linux (write to STDOUT is blocking), however you can't put I/O on another thread in the same way on Windows.
Anyway, all this to say: it's clear you've never worked in a serious windows environment. I've deployed many thousands of bare-metal windows machines across the world and of course it was automated, from PXE/BIOS to application serving on the internet, the whole 9 yards, but believing that the UI has no effect or no effectiveness of administration is just absurd.
> So we need to hold regulatory bodies accountable as well...
My bank, my insurer, my payment card processor, my accounting auditor and probably others may all insist I have anti-virus and insist that it is up to date. That is why we have to have these systems. However, I used to prefer systems that allowed me to control the update cycle and push it to smaller groups.
> So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
Replacing common-law liability with prescriptive regulation is one of the main drivers of this problem today. Instead of holding people accountable for the actual consequences of their decisions, we increasingly attempt to preempt their decisions, which is the very thing that incentivizes cargo-cult "checkbox compliance".
It motivates people who otherwise have skin in the game and immediate situational awareness to outsource their responsibility to systems of generalized rules, which by definition are incapable of dealing effectively with outliers.
No doubt there will be another piece of software mandated to check up on the compliance software. When that causes a global IT outage, software that checks up on the software that checks up on the compliance software will be mandated.
When Crowdstrike messes up and BSODs thousands of machines, they have a dedicated team of engineers working the problem and can deliver a solution.
When your company gets owned because you didn't check a compliance checkbox, it's on you to fix it (and you may not even currently have the talent to do so).
We see similar risk tradeoffs in cloud computing in general; yes, hosting your stuff on AWS leaves you vulnerable to AWS outages, but it's not like outages don't happen if you run your own iron. You're just going to have to dispatch someone a three hour drive away to the datacenter to fix it when they do.
CrowdStrike has various auto update policies, including not to automatically update to the latest version, but to the latest version -1 or even -2. Customers with those two policies are also impacted.
> Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting.
I've been someone in one of those audit meetings defending decisions made and defending things based on the records we keep and I understand this because it is both a deeply unpleasant and expensive affair to pull people from current projects and place them before auditors for several hours to debate what compliance actually means.
It’s even worse. The consultants who run the audits (usually business school recent grads) work with other consultants who shill the third party software and implementation work.
So true! It seems like all of these were invented to create another market for b2b saas security, audit, monitoring, etc. companies. Nobody cares about actual security or infrastructure anymore. Everything is just buying some subscription for random saas companies, not checking their permissions and grant policies and ticking boxes because... compliance.
It depends on what your position is. Are you there to actually provide security to your org or to tick a in an audit. If both which is more important. Because failing an audit have real consequences, while having breaches in security have almost none. Just look at credit score companies.
Regulation or auditors rarely require specific solutions. It's the companies themselves that choose to achieve the goals by applying security like tinctures: "security solutions". The issue is that the tinctures are an approved remedy.
Zscaler is such insane garbage. Legitimately one of the worst pieces of software I have ever used. If your organization is structurally competent, it will never use Zscaler and will just use wireguard or something.
It's VERY easy to blame CrowdStrike and companies like them as they are the one LOBBYING for those checkboxes. Both zscaler and Crowdstrike spent 500K last year lobbying.
There's a reasonable number of circumstances where there are cybersecurity standards that get imposed on organisations: insurance, from a customer, or from the government (especially if they are a customer). These standards are usually fairly reasonably written, but they are also necessarily vague and say stuff like "have a risk assessment", and "take industry-standard precautions". This vagueness can create a kind of escalation ratchet: when people tasked with (or responsible for) compliance are risk-averse and/or lazy, they will essentially just try to find as many precautions as they can find and throw them all in blanket-style, because it's the easiest and safest way to say that you're in compliance. This is especiallly true when you can more or less just buy one or two products which promise to basically tick every possible box. And if something else pops up as a suggestion, they'll throw that in too. Which then becomes the new 'industry standard', and it becomes harder to justify not doing it, and so on.
It's easy to blame CrodStrike because they're the ones to blame here. They lit a billion system32 folders on fire with an untested product and push out fear mongering corny marketing material. Turns out you should be afraid.
> All over the place I'm seeing checkbox compliance being prioritized above actual real risks from how the compliance is implemented.
Because if everyone is doing their job and checks their box, they're not gonna get fired. Might be out of a job because the company goes under, but hey, it was no one's fault, they just did their job.
My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up and then in the critical path of every network connection the computer makes. The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented. Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting. So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.