No More Blue Fridays

mrpippy · 2024-07-22T13:33:37 1721655217

> Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well.

This doesn’t seem grounded in reality. If you follow the link to the “hooks” that Windows eBPF makes available [1], it’s just for incoming packets and socket operations. IOW, MS is expecting you to use the Berkeley Packet Filter for packet filtering. Not for filtering I/O, or object creation/use, or any of the other million places a driver like Crowdstrike’s hooks into the NT kernel.

In addition, they need to be in the kernel in order to monitor all the other 3rd party garbage running in kernel-space. ELAM (early-launch anti-malware) loads anti-malware drivers first so they can monitor everything that other drivers do. I highly doubt this is available to eBPF.

If Microsoft intends eBPF to be used to replace kernel-space anti-malware drivers, they have a long, long way to go.

[1]: https://microsoft.github.io/ebpf-for-windows/ebpf__structs_8...

brendangregg · 2024-07-22T13:54:20 1721656460

Yes, we know eBPF must attach to equivalent events to Linux, but given there are already many event sources and consumers in Windows, the work is to make eBPF another consumer -- not to invent instrumentation frameworks from scratch.

Just to use an analogy: Imagine people do their banking on JavaScript websites with Google Chrome, but if they use Microsoft Edge it says "JavaScript isn't supported, please download and run this .EXE". I'm not sure we'd be asking "if" Microsoft would support JavaScript (or eBPF), but "when."

surajrmal · 2024-07-22T14:06:04 1721657164

This assumes eBPF becomes the standard. It's not clear Microsoft wants that. They could create something else which integrates with dot net and push for that instead.

Also this problem of too much software running in the kernel in an unbounded manner has long existed. Why should Microsoft suddenly invest in solving it on Windows?

wongarsu · 2024-07-22T17:47:53 1721670473

Microsoft has invested in solving this for at least two decades, probably longer. They are just using a different (arguably worse) approach to this than the Unix world.

In Windows 9x anti-malware would just run arbitrary code in the kernel that hooked whatever it wanted. In Windows XP a lot of these things got proper interfaces (like the file system filter drivers to facilitate scanning files before they are accessed, later replaced by minifilters), and the 64 bit edition of XP introduced PatchGuard [1] to prevent drivers from modifying Microsoft's kernel code. Additionally Microsoft is requiring ever more static and dynamic analysis to allow drivers to be signed (and thus easily deployed).

This is a very leaky security barrier. Instead of a hardware-enforced barrier like the kernel-userspace barrier it's an effort to get software running at the same protection level to behave. PatchGuard is a cat-and-mouse game Microsoft is always loosing, and the analysis mostly helps against memory bugs but can't catch everything. But MS has invested a lot of work over the years in attempts to make this path work. So expecting future actions isn't unreasonable.

[1] https://en.wikipedia.org/wiki/Kernel_Patch_Protection

Analemma_ · 2024-07-22T19:44:48 1721677488

This is a weird reading of history. Microsoft has spent tons of effort getting as much code out of the kernel as possible: Windows drivers used to be almost all kernel-mode, now they're nearly all in userspace and you almost never need to write a kernel-mode Windows driver unless you're doing something with deep OS hooks (like CS was, although apparently even that wasn't actually necessary). The safeguards on kernel code are for the tiny sliver of use cases left that need it, it is not Microsoft patching individual holes on the leaky ship.

They haven't yet gone as far as Apple in banning third-party kernel-mode code entirely, but I wouldn't be surprised if it's coming.

tptacek · 2024-07-22T21:34:24 1721684064

A thing I think a lot of people don't include in their premises about Crowdstrike is that they're probably the most significant aftermarket endpoint security product in the world (they are what Norton and McAfee were in 2000), which means they're more than large enough for malware to target their code directly, which creates interesting constraints for where their code can run.

I'm not saying I'd run it (I would not), just that I can see why they have a lot of kernel-resident code.

jolux · 2024-07-24T15:01:05 1721833265

> I'm not saying I'd run it (I would not), just that I can see why they have a lot of kernel-resident code.

What would you run instead, or is there a different way of thinking about the problem it addresses that obviates the need?

fsflover · 2024-07-27T09:01:40 1722070900

Not the parent, but security through compartmentalization seems like a more robust approach. See: Qubes OS.

roenxi · 2024-07-23T01:28:22 1721698102

Microsoft made the reasonable point that locking 3rd parties out of the kernel might have resulted in legal challenges in the EU [0]. It is an interesting case where everyone is certain in hindsight that they would have been ok with MS blocking access, but it is less obvious that they would have taken that view if MS had pressured a bunch of security products out of the kernel with no obvious prompting.

[0] https://www.theregister.com/2024/07/22/windows_crowdstrike_k...

Archelaos · 2024-07-23T12:41:53 1721738513

The 2009 agreement with the EU mentioned in the article seems to be the one about the integration of the Internet Explorer (IE) into MS Windows.[1] But it only applied to IE and the commitment was limited to 5 years.[2]

Or is the article referring to something else?

I see no reason why the EU should object to Microsoft's adoption of eBPF as long as MS Defender simply uses the same API that is available to all competitors.

[1] Here is the original text: https://ec.europa.eu/competition/antitrust/cases/dec_docs/39...

[2] See section 4, paragraph 20.

philistine · 2024-07-22T14:22:29 1721658149

Apple took the lead on this front. It has closed easy access to the kernel by apps, and made a list of APIs to try and replace the lost functionality. Anyone maintaining a kernel module on macOS is stuck in the past.

Of course, the target area of macOS is much smaller than Windows, but it is absolutely possible to kick all code, malware and parasitic security services alike, from accessing the kernel.

The safest kernel is the one that cannot be touched at runtime.

nullindividual · 2024-07-22T14:34:30 1721658870

I don't think Microsoft has a choice with regards to kernel access. Hell, individuals currently use undocumented NT APIs. I can't imagine what happens to backwards compat if kernel access is closed.

Apple's closed ecosystem is entirely different. They'll change architectures on a whim and users will go with the flow (myself included).

becurious · 2024-07-22T15:51:51 1721663511

But Apple doesn’t have the industrial and commercial uses that Linux and Windows have. Where you can’t suddenly switch out to a new architecture without massive amounts of validation costs.

At my previous job they used to use Macs to control scientific instrumentation that needed a data acquisition card. Eventually most of the newer product lines moved over to Windows but one that was used in a validated FDA regulated environment stayed on the Mac. Over time supporting that got harder and harder: they managed through the PowerPC to Intel transition but eventually the Macs with PCIe slots went away. I think they looked at putting the PCIe card in a Thunderbolt enclosure. But the bigger problem is guaranteeing supply of a specific computer for a reasonable amount of time. Very difficult to do these days with Macs.

nullindividual · 2024-07-22T16:28:15 1721665695

> validated FDA regulated environment stayed on the Mac

Given how long it takes to validate in a GxP environment, and the cost, this makes sense.

adolph · 2024-07-22T19:00:13 1721674813

Sounds like they need a nice Hackintosh for that validated FDA regulation app-OS-HW combo.

becurious · 2024-07-22T22:22:50 1721686970

Good luck getting that through a regulated company’s Quality Management System or their legal department. Way too much business risk and the last thing you want is a yellow or red flag to an inspector who can stop ship on your product until all the recall and remediation is done.

philistine · 2024-07-23T16:21:31 1721751691

See Satya Nadella has recently said that Microsoft will now put security above any other value at Microsoft. He specifically even singled out backwards compatibility.

https://blogs.microsoft.com/blog/2024/05/03/prioritizing-sec...

Microsoft is a big boat. It takes a long time to turn if the captain orders it. But the captain has ordered it. I expect the sacrosanct nature of back compat to eventually die. Windows will never turn into the moving target that macOS is, but I expect a lot of old stuff to just stop working, when security concerns dictate they should be removed.

Xunjin · 2024-07-22T14:26:16 1721658376

> The safest kernel is the one that cannot be touched at runtime.

Can you expand what you mean here? Because depending on the application you are running, you will need at least talk with some APIs to get privileged access?

Agingcoder · 2024-07-22T17:53:09 1721670789

Being allowed to talk to the kernel to get info and running with the same privileges ( basically being able to read / write any memory ) is different.

odo1242 · 2024-07-22T16:23:37 1721665417

Yeah, Apple doesn’t allow any user code to run in kernel mode without significant hoops (the kernel is code signed) and tries to provide a user space API (e.g. DriverKit) as an alternative for the missing functionality.

Some things (FUSE) are still annoying though.

duskwuff · 2024-07-23T06:16:11 1721715371

> Some things (FUSE) are still annoying though.

That should get much easier in macOS Sequoia with FSKit.

https://developer.apple.com/documentation/fskit/

brendangregg · 2024-07-22T14:11:39 1721657499

Microsoft have been driving the work to make eBPF an IETF industry standard.

riskable · 2024-07-22T16:32:22 1721665942

...just like they did with Kerberos! And just like with Kerberos they'll define a standard then refuse to follow it. Instead, they will implement subtle changes to the Windows implementation that make solutions that use Windows eBPF incompatible with anything else, making it much more difficult to write software that works with all platforms eBPF (or even just its output).

Everything's gotta be different in Windows land. Otherwise, migrating off of Windows land would be too easy!

In case you were wondering what Microsoft refused to implement with its Kerberos implementation it's the DNS records. Instead of following the standard (they wrote!) they decided that all Windows clients will use AD's Global Catalog to figure out which KDC to talk to (e.g. which one is "local" or closest to the client). Since nothing but Windows uses the Global Catalog they effectively locked out other platforms from being able to integrate with Windows Kerberos implementation as effectively (it'll still work, just extremely inefficiently as the clients won't know which KDC is local so you either have to hard-code them into the krb5.conf on every single device/server/endpoint and hope for the best or DNS-and-pray you don't get a Domain Controller/KDC that's on an ISDN line in some other country).

xp84 · 2024-07-23T13:30:07 1721741407

This to me ascribes way too much to mustache-twirling villainy at Microsoft, but to me fails to account for the fact that engineers surely make many of these implementation-detail decisions. These engineers aren’t incentivized to create lock-in. I think it’s more likely that sometimes for a feature to play well with other existing parts of the Windows ecosystem, compromises are made to the standards-compliance. Microsoft may have shipped those related interfaces before this standard had been hashed out, so they have a choice to break everything or to not perfectly follow the standard.

Note: I’m not a Windows dev so I can’t speak to specifics of anything like your Kerberos example. I just don’t believe MS is full of evil engineers, nor that Satya Nadella visits cubicles to promote lock-in practices.

CRConrad · 2024-07-30T19:02:16 1722366136

> These engineers aren’t incentivized to create lock-in.

Ever heard of something called “money”?

> I think it’s more likely that sometimes for a feature to play well with other existing parts of the Windows ecosystem, compromises are made to the standards-compliance.

So you're basically saying that you're too young to remember the “good” old days of Embrace, Extend, Extinguish, right...?

MawKKe · 2024-07-22T17:39:22 1721669962

Embrace, extend, ...

jrockway · 2024-07-22T18:57:02 1721674622

This doesn't really seem like their strategy anymore. It's not like Edge directly interprets Typescript, for example. While they embraced and extended Javascript, any extinguishing seems to be on the technical merits rather than corporate will.

In the case of security scanners that run in the kernel, we learned this weekend that a market need exists. The mainstream media blamed Crowdstrike's bugs on "Windows". Microsoft would likely like to wash its hands of future events of this class. Linux-like eBPF is a path forward for them that allows people to run the software they want (work-slowers like Crowdstrike) while isolating their reputation from this software.

chupasaurus · 2024-07-23T09:19:46 1721726386

Third step of the strategy mentioned by GP is enacted when market share allows it, and Edge is far below Chrome atm.

numbsafari · 2024-07-22T17:07:01 1721668021

> Why should Microsoft suddenly invest in solving it on Windows?

If they can continue to avoid commercial repercussions for failing to provide a stable and secure system, then society should begin to hold them to account and force them to.

I’m not necessarily advocating for eBPF here, either. If they want to get there through some “proprietary” means, so be it. Apple is doing much the same on their end by locking down kexts and providing APIs for user mode system extensions instead. If MS wants to do this with some kind of .net-based solution (or some other fever dream out of MSR) then cool. The only caveat would seem to be that they are under a number of “consent decree” type agreements that would require that their own extensions be implemented on a level playing field.

So what. Windows Defender shouldn’t be in the kernel any more than CrowdStrike. Add an API. If that means being able to send eBPF type “programs” into kernel space, cool. If that means some user mode APIs, cool.

But lock it down already.

safety1st · 2024-07-23T09:28:38 1721726918

Not necessarily disagreeing with you, but as far as 'avoiding commercial repercussions' goes... Windows' share of the desktop OS has market has been declining for almost 20 years at the rate of about 1% per year. And about 70% of the global installed base is still on Windows 10.

They have a long way to fall, but I'm not sure that if I'm a regulator I look at that and say there needs to be some kind of intervention by society apart from what market forces are gradually doing anyway.

doctorpangloss · 2024-07-23T00:31:59 1721694719

Windows development on eBPF is slower than Linux development on eBPF, so it will never be supported. A source code user licensee could develop it faster, but who licenses Windows source and already has great eBPF experience?

nullindividual · 2024-07-22T14:32:25 1721658745

Microsoft already has an extensible file system filter capability in place, which is what current AV uses. Does it make sense to add eBPF on top of that and if so, are there any performance downsides, like we see with file system filters?

mauvehaus · 2024-07-22T15:41:25 1721662885

They've done a technology transition once already from legacy file system filter drivers to the minifilter model. If they see enough benefit to another change, it wouldn't be unprecedented.

Mind you, it looks like after 20-ish years Windows still supports loading legacy filter drivers. Given the considerable work that goes into getting even a simple filesystem minifilter driver working reliably, it's safe to assume that we'd be looking at a similarly protracted transition period.

As to the performance, I don't think the raw infrastructure to support minifilters is the major performance hit. The work the drivers themselves end up doing tends to be the bigger hit in my experience.

Some background for the curious:

https://www.osr.com/nt-insider/2019-issue1/the-state-of-wind...

shahahqq · 2024-07-22T13:45:27 1721655927

I hope though that Microsoft will double down on their eBPF support for Windows after this incident.

benfortuna · 2024-07-22T14:30:33 1721658633

Keep in mind they don't just allow any old code to execute in the kernel.

They do have rigorous tests (WHQL), it's just Crowdstrike decided that was too burdensome for their frequent updates, and decided to inject code from config files (thus bypassing the control).

The fault here is entirely with Crowdstrike.

capitainenemo · 2024-07-22T14:41:37 1721659297

Is there any evidence that the config files had arbitrary code in them? The only analysis I'd seen so far indicated a parsing error loading a viral signature database that was routinely updated, but in this case was full of garbage data.

benfortuna · 2024-07-22T14:49:18 1721659758

Perhaps not verified, but some smart people do have convincing arguments:

https://youtu.be/wAzEJxOo1ts?si=UNNxAN27VV1E6mcP&t=505

capitainenemo · 2024-07-22T15:05:30 1721660730

Any article/blog/text-that-can-be-read?

alecco · 2024-07-22T17:10:15 1721668215

Don't bother. He just repeats a tweet saying a null+offset dereference and also the speculation of that null picked from the sys file.

remram · 2024-07-22T16:08:46 1721664526

How rigorous are the tests if faulty data can brick the machine?

dwattttt · 2024-07-22T16:20:26 1721665226

Not rigorous enough to have detected this flaw in the kernel sensor, although effectively any bug in this situation (an AV driver) can brick a machine. I imagine WHQL isn't able to find every possible bug in a driver you submit to them, they're not your QA team.

stackskipton · 2024-07-22T14:06:21 1721657181

Doubt it. Microsoft is clearly over Windows. They continue to produce it but every release feels like "Ugh, fine, since you are paying me a ton of money."

Internally, Microsoft is running more and more workloads on Linux and externally, I've had .Net team tell me more than once that Linux is preferred environment for .Net. SQL Server team continues to push hard for Linux compatibility with every release.

EDIT: Windows Desktop gets more love because they clearly see that as important market. I'm talking more Windows Server.

mosburger · 2024-07-22T14:42:42 1721659362

> SQL Server team continues to push hard for Linux compatibility with every release.

It's kinda funny that the DB that was once a fork of Sybase that was ported to Windows is trying to make its way back to Unix.

throwaway2037 · 2024-07-22T14:19:48 1721657988

This claim about SQL Server: Is it due to disk access being slower from NT kernel compared to Linux kernel?

riskable · 2024-07-22T16:41:10 1721666470

I had read previously from an unverified SQL Server engineer that the thing they wanted most (with Linux support) was proper containerization (from a developer perspective). Apparently containers on Windows just don't cut it (which is why nobody uses them in production). Take it with a grain of salt though.

I don't think they'd ever admit that filesystem performance was an issue (though we all know it is; NTFS is over 30 years old!).

shawnz · 2024-07-22T17:48:55 1721670535

> though we all know it is; NTFS is over 30 years old!

ext2, which is forwards compatible with ext3 and ext4, is slightly older than NTFS

nrr · 2024-07-23T04:34:17 1721709257

It's my understanding, having done benchmarks on file access on Windows, that NTFS itself is not the problem. It's old, but the revision of the on-disk structure that we use today hails from Windows XP, and it's about on par in terms of feature parity (and backwards compatibility, given that I can still read native NT 3.51 volumes on Windows 11) with ext4.

A lot of the weirdly bad performance comes from all of the machinery that Windows wraps around file access for things like filter drivers. As long as you don't, say, indiscriminately follow every CreateFile() with a CloseHandle() and instead treat handle closure like garbage collection, you can actually eke out pretty good performance.

That all said, yeah, Windows containers are less than great for what I'd argue is one strikingly glaring flaw: Docker container images are built from smss.exe upward. That makes them not immediately portable between ntoskrnl.exe releases.

stackskipton · 2024-07-22T14:36:57 1721659017

It's just easier for everyone involved (outside Windows GUI clicker admins) if it runs on Linux. Containerization is easier, configuration is easier and operating system is much more robust.

psd1 · 2024-07-23T08:48:13 1721724493

Operating system can be more robust, depending on admin skill. Let idiots configure and operate your rhel and you may not get those five nines.

There are costs to it, in the form of architectural baggage and slower iteration, but what windows brings to the table is a deck swept mostly clear of footguns. That can give you a different form of robustness.

marcosdumay · 2024-07-22T15:38:39 1721662719

There's something very wrong with Windows disk access, you can see it easily by trying to run a Windows desktop with rotating disks.

But SQL Server is in the unique position of being able to optimize Windows for their own needs. So they shouldn't have this kind of problem.

devbent · 2024-07-22T23:51:55 1721692315

The file system is almost 30 years old.

When NTFS came out it was way better than anything on Linux. Heck even in 2006 NTFS was better.

But Linux keeps getting new file systems while Windows keeps NTFS.

marcosdumay · 2024-07-23T13:33:43 1721741623

AFAIK, NTFS is a perfectly ok design. But the Windows file system never performed well. This is probably not for architectural reasons.

And no, it didn't perform better at NT4, XP, or Win7 times.

kevincox · 2024-07-22T14:16:41 1721657801

They aren't over windows. They continue to be incredibly interested in and actively developing how much money they can suck from their users. Especially via various forms of ads.

But yeah, kernel features are few and far between.

queuebert · 2024-07-22T16:42:19 1721666539

I believe the term you are looking for is "rent seeking". Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have? (I'm being generous with XP, because actually 95 was already mostly internet ready.) Yet how many times have many of us paid for a Windows license on a new computer or because the old version stopped getting updates?

pcwalton · 2024-07-22T17:21:53 1721668913

> Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have?

Off the top of my head, limiting myself to just NT kernel stuff: WSL and Hyper-V, pseudo-terminals, condvars, WDDM, DWM, elevated privilege programs on the same desktop, font driver isolation, and limiting access to win32k for sandboxing.

recursive · 2024-07-22T17:50:51 1721670651

> what new functionality does Windows 11 actually have that Windows XP didn't have? (

Off the top of my head, built-in bluetooth support, an OS-level volume mixer, and more support for a wider variety of class-compliant devices. I'm sure there are a lot more, and if you actually care about the answer, I don't think it would be hard to find.

queuebert · 2024-07-22T20:32:10 1721680330

All of this could've been added to XP, right?

recursive · 2024-07-22T21:05:34 1721682334

I don't know.

If it could, Then XP would just be Windows 11. What's the objection here.

queuebert · 2024-07-23T15:29:09 1721748549

Simple patches/upgrades vs tricking people into thinking you've made a whole new piece of software. Linux, BSD, and Apple roll out OS upgrades with new functionality without charging for the new versions.

recursive · 2024-07-23T15:36:46 1721749006

That's one perspective I suppose. I have a MacBook on my desk at work solely for testing in Safari. I can no longer use it for that purpose because it won't even let me upgrade the OS. That sounds like a whole new piece of software to me. Windows actually has been substantially re-written. I guess MacOS has also? It seems more honest to me call it a different product.

Not that I strongly care much one way or another.

psd1 · 2024-07-23T08:55:03 1721724903

Longhorn was a significant rewrite, actually. The two big upheavals in windows history were: 2000, which essentially scrapped the 95 lineage in favour of NT; and Vista, which kicked a lot of 3rd-party crap out of the kernel and added a quality gate for drivers.

umanwizard · 2024-07-23T12:15:27 1721736927

The Win95 lineage still existed, in the form of Windows ME, alongside 2000. XP is when they got rid of it and unified the two product lines.

vitus · 2024-07-22T19:22:31 1721676151

> Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have?

Modern crypto ciphersuites that aren't utterly broken? Your best options for symmetric crypto with XP are 3DES (officially retired by NIST as of this year) and RC4 (prohibited in TLS as of RFC 7465).

(And if you think 3DES isn't totally broken by itself, you're right... except for the part where the ciphersuite in question is in CBC mode and is vulnerable to BEAST. Thanks, mandated ciphersuites.)

wolrah · 2024-07-22T21:06:54 1721682414

> Other than visual changes, what new functionality does Windows 11 actually have that Windows XP didn't have?

XP->Vista alone brought a bunch of huge changes that massively improved security (UAC), capability (64 bit desktops), and future-proofing (UEFI) among many many other things.

Some helpful Wikipedia editors have answered this question in excessive detail, so I'm just going to link those for more info. Also I'm going to start with what XP changed from 2003 both because it makes a good comparison and I'd argue 2000/NT 5.0 is the root of the modern Windows era. Your next sentence after the quote implies you probably won't have a problem with that.

* XP/2003: https://en.wikipedia.org/wiki/Features_new_to_Windows_XP

* 2003R2: https://en.wikipedia.org/wiki/Windows_Server_2003#Windows_Se...

* Vista: https://en.wikipedia.org/wiki/Features_new_to_Windows_Vista

* 2008: https://en.wikipedia.org/wiki/Windows_Server_2008#Features

* 7: https://en.wikipedia.org/wiki/Features_new_to_Windows_7

* 2008R2: https://en.wikipedia.org/wiki/Windows_Server_2008_R2#New_fea...

* 8: https://en.wikipedia.org/wiki/Features_new_to_Windows_8

* 2012: https://en.wikipedia.org/wiki/Windows_Server_2012#Features

* 8.1: https://en.wikipedia.org/wiki/Windows_8.1#New_and_changed_fe...

* 2012R2: https://en.wikipedia.org/wiki/Windows_Server_2012_R2#Feature...

* 10: https://en.wikipedia.org/wiki/Features_new_to_Windows_10

* 2016: https://en.wikipedia.org/wiki/Windows_Server_2016#Features

* 2019: https://en.wikipedia.org/wiki/Windows_Server_2019#Features

* 2022: https://en.wikipedia.org/wiki/Windows_Server_2022#Features

* 11: https://en.wikipedia.org/wiki/Features_new_to_Windows_11

* 2025: https://learn.microsoft.com/en-us/windows-server/get-started...

Obviously some of this will be "fluff" and that's up to your own personal definitions, but to act like there haven't been significant changes in every major revision is just nonsense.

queuebert · 2024-07-23T15:47:26 1721749646

Well that Windows 11 article is laughably short, considering it's a major version. But I appreciate you taking the time to compile all those links.

My point is the vast majority of this stuff is either "fluff" or cosmetic changes or random things that 99% of users don't use OR they are security and bug patches. HN users are not typical, so I'm sure some of the Windows updates are very important for people like us.

Maybe to Microsoft this is a significant rewrite: "The Calculator has been completely rewritten in C# and includes several new features." (Just picked at random.) Ok, but like why? Who cares? What was wrong with the last calculator? Absolutely nothing. Also who even uses Windows calculator instead of Excel or their phone? Was calculator rewritten to justify an FTE somewhere at Microsoft?

I'm not trying to troll, but I am trying to be contrarian. I honestly feel like a majority of desktop users don't really think too hard about their OS. None of the existing OSes should be significantly rewritten unless they are just completely flawed. Like say Apple decides to ditch the microkernel or Linux goes to Rust. Most people need stability and security, not new calculator features or different button shading. I'm singling out Microsoft for being the only one that rent seeks for superfluous changes. Apple is notoriously bad about wasting users time with constant updates for dumb stuff, but at least it's free, except for the cost of time while your computer slowly reboots and updates.

rob74 · 2024-07-22T14:31:47 1721658707

See also: https://en.wikipedia.org/wiki/Cash_cow

WatchDog · 2024-07-23T23:15:29 1721776529

SQL Server has supported Linux since 2017

kevin_nisbet · 2024-07-22T13:56:54 1721656614

I hate to dispute with someone like Brendan Gregg, but I'm hoping vendors in this space take a more holistic approach to investigating the complete failure chain. I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure. It may be true, but if we don't do the analysis we could leave ourselves open to blindspots. There may also be plenty of alternative approaches that should be considered and appropriately discarded.

I think the part I specifically dispute is the only negative outcome is wasted CPU cycles. That's likely the case for the class of bug, but there are plenty of failure modes where a bad ruleset could badly brick a system and make it hard to recover.

That's not to say eBPF based security modules isn't the right choice for many vendors, just that let's understand what risks they do and do not avoid, and what part of the failure chain they particularly address.

mirashii · 2024-07-22T16:27:38 1721665658

Just because you have not been aware of the discussions on this topic that have been happening for years, doesn't mean that they haven't been happening. This isn't some new analysis formed 3 days after an incident, this is the generally accepted consensus among many experts who have been working in the space, introducing these new APIs specifically to improve stability, security, etc. of systems.

ohmyiv · 2024-07-22T17:06:21 1721667981

> I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure.

Microsoft has been working on eBPF for a few years at least.

https://opensource.microsoft.com/blog/2021/05/10/making-ebpf...

https://lwn.net/Articles/857215/

If you're really concerned, they have discussions and communication channels where you're invited to air your concerns. They're listed on their github:

https://github.com/microsoft/ebpf-for-windows

Who knows, maybe they already have answers to your concerns. If not, they can address them there.

kayo_20211030 · 2024-07-22T13:15:36 1721654136

This isn't right. If I need a system to run with a piece of code, then it shouldn't run at all if that piece of code is broken. Ignoring the failure is perverse. Let's say that the driver code ensures that some medical machine has safety locks (safeguards) in place to make sure that piece of equipment won't fry you to a crisp; I'd prefer that the whole thing not run at all rather than blithely operate with the safeguards disabled. It's turtles all the way down.

Smaug123 · 2024-07-22T13:38:10 1721655490

I think the premise is false? It's up to the eBPF implementor what to do in the case of invalid input; the kernel could choose to perform a controlled shutdown in that case. (I have no idea what e.g. Linux actually does here, but one could imagine worlds where the action it takes on invalid input is configurable.)

Also your statement is sometimes not true, although I certainly sympathise in the mainline case. In some contexts you really do need to keep on trucking. The first example to spring to mind is "the guidance computers on an automated Mars lander"; the round-trip to Earth is simply too long to defer responsibility in that case. If you shut down then you will crash, but if you do your best from a corrupted state then you merely probably crash, which is presumably better.

umanwizard · 2024-07-22T14:03:09 1721656989

> I have no idea what e.g. Linux actually does here

If you attempt to load an eBPF program that the verifier rejects, the syscall to load it fails with EINVAL or E2BIG. What your user-space program then does is up to you, of course.

phartenfeller · 2024-07-22T13:47:00 1721656020

The medical machine software should just refuse to run with an error message if a critical driver was not loaded. The OS bricking is causing way more trouble where an IT technician now needs to fix something where it otherwise would just be updating the faulty driver... Also does your car not start if you are missing water for the wiper?

jve · 2024-07-22T13:59:06 1721656746

Water for the wiper is userland feature.

3rd party hooking into kernel is 3rd party responsibility. It is like equipping your car with LPG - THAT hooks into engine (kernel). And When I had a faulty gas pressure sensor then my car actually halted (BSOD if you will) instead of automatically failing over to gasoline as it is by design.

You can argue that car had no means to continue execution but kernel has, however invalid kernel state can cause more corruption down the road. Or as parent even points out - carry out lethal doses of something.

pinebox · 2024-07-22T14:36:28 1721658988

Initially I was inclined to disagree ("these things should always fail safe") however with more and more stuff being pushed into the kernel it's hard to say that you're wrong or exactly where a line needs to be drawn between "minimally functional system" and "dangerously out of control system".

I think until we discover a technology that forces commercial software vendors to employ functioning QA departments none of this will really solve anything.

enragedcacti · 2024-07-22T17:24:55 1721669095

I agree that some system components should be treated as critical no matter what, but the software at issue in this case (Falcon Sensor or Antivirus more generally) is precautionary and only best effort anyways. I would wager the vast majority of the orgs affected on Friday would have preferred the marginally increased risk of a malware attack or unauthorized use over a 24 hour period instead of the total IT collapse they experienced. Further, there's no reason the bug HAD to cause a BSOD, it's possible the systems could have kept on trucking but with an undefined state and limitless consequences. At least with eBPF you get to detect a subset of possible errors and make a risk management decision based on the result.

kayo_20211030 · 2024-07-22T21:54:33 1721685273

I'm with you. What's critical, and what's not? Is it a big thing, or not a big thing? Is this particular machine more critical than the one over there? Security systems need to be at the lowest level, or else some shifty bastard will find a path around them. If it's at the lowest level, the downside of a failure is catastrophic, as we experienced last Friday. The carnage here is ultimately on CrowdStrike. The testing must have been slapdash at best, and missing at worst. eBPF changes nothing. The question is: should we fail, or carry on? eBPF doesn't help with that decision, it only determines the outcome from a system perspective. Any decision is a value judgement; it might be right or wrong, and its outcome either benign or deadly. Choices!

__MatrixMan__ · 2024-07-22T15:13:31 1721661211

I like how Unison works for this reason. You call functions by cryptographic hash, so you have some assurance that you're calling the same function you called yesterday.

Updates would require the caller to call different functions which means putting the responsibility in the hands of the caller, where it should be, instead of on whoever has a side channel to tamper with the kernel.

You end up with the work-perfectly-or-not-at-all behavior that you're after because if the function that goes with the indicated hash is not present, you can't call it, and if it is present you can't call it in any way besides how it was intended

emn13 · 2024-07-22T20:19:52 1721679592

The system clearly already behaves that way (i.e. ignores failure) - after all, the fix was to simply delete the offending file. If that's an option, then loader can do that too. It can and perhaps even is smarter, such as "fallback onto previous version".

Furthermore, the reaction to a malformed state need not be "ignore". It could disable restricted user login; or turn off the screen.

If the worry is that this is viable to abuse by malware, well, if the malware can already rewrite the on-disk files for the AV, I wonder whether it's really a good idea to trust the system itself to be able to deal with that. It'd probably be safer to just report that up the security foodchain, and potentially let some external system take measures such as disable or restrict network access. Better yet, such measures don't even require the same capabilities to intervene in the system, merely to observe - which makes the AV system less likely to serve as a malware vector itself or to cause bugs like this.

ChrisMarshallNY · 2024-07-22T15:06:35 1721660795

> Ignoring the failure is perverse.

If the failed system is a security module, I think that's absolutely correct. If the system runs, without the security module, well, that's like forgetting to pack condoms on Shore Leave. You'll likely be bringing something back to the ship with you.

Someone needs to be testing the module, and the enclosing system, to make sure it doesn't cause problems.

I suspect that it got a great deal of automated unit testing, but maybe not so much fuzz and monkey (especially "Chaos Monkey"-style) testing.

It's a fuzzy, monkey-filled world out there...

kayo_20211030 · 2024-07-22T22:03:49 1721685829

Interesting analogy, but yes. If the module *is* necessary, well, it's necessary and nothing should work without it. Testing must have been a mess here.

amluto · 2024-07-23T09:11:04 1721725864

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

eBPF is fantastic, and it can be used for many purposes and improve a lot of things, but this is IMO overselling it. Assuming that BPF itself it free of bugs, it’s still a rather large sprawl of kernel hooks, and those hooks invoke eBPF code, which can call right back into the kernel. Here’s a list:

https://www.man7.org/linux/man-pages/man7/bpf-helpers.7.html

bpf_probe_read_kernel() is particularly heavily used, and it is not safe. It tries fairly hard not to OOPS or crash, but it is definitely not perfect.

The rest of that list contains plenty of this that will easily take down a system, even if it doesn’t actually oops or panic in the process.

And, of course, any tool that detects userspace “malicious behavior” and stops it can start calling everything malicious, and the computer becomes unusable.

Meanwhile, eBPF has no real security model on the userspace side. Actual attachment of an eBPF program goes through the bpf() syscall, not through sensibly permissioned operations on the underlying kernel objects being attached to, and there is nothing whatsoever that confines eBPF to, say, a container that uses it. (See bpf_probe_read_kernel() -- it's fundamentally able to read all kernel memory.)

So, IMO, most of the benefit of eBPF over ordinary kernel C code is that eBPF is kind of like writing code in a safe language with a limited unsafe API surface. It's a huge improvement for this sort of work, but it is not perfect by any means.

> The verifier is rigorous -- the Linux implementation has over 20,000 lines of code

The verifier is absurdly complex. I'd rather see something based on formal methods than 20kLOC of hand-written logic.

umanwizard · 2024-07-23T12:17:40 1721737060

How is it possible to panic using bpf_probe_read_kernel ? Can you give an example that works on the current kernel version?

amluto · 2024-07-24T04:12:09 1721794329

I'm not sure that "panic" is the right word here. bpf_probe_read_kernel boils down to copy_from_kernel_nofault, which checks for an "allowed" address and then does the access. Any page faults turn into error returns instead of OOPSes. x86 disallows user addresses, the vsyscall page, and non canonical addresses.

Doing this from bpf assumes that all "allowed" addresses are side-effect-free and will either succeed or cleanly fault. Off the top of my head, MMIO space (including, oddities like the APIC page on CPUs that still have that) and TDX memory are not in this category.

uticus · 2024-07-22T14:59:09 1721660349

> eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox.

Isn’t one of the purposes of an OS to police software? I get that this has to do with the OS itself, but what does watching the watchers accomplish other than adding a layer which must then be watched?

Why not reduce complexity instead of naively trusting that the new complexity will be better long term?

riskable · 2024-07-22T16:57:14 1721667434

eBPF isn't "watching the watchers" it's just a tool that lets other tools access low-level things in the kernel via a very picky sandbox. Think of it like this:

Old way: Load kernel driver, hook into bazillions of system calls (doing whatever it is you want to do), pray you don't screw anything up (otherwise you can get a panic though not necessarily--Linux is quite robust).

eBPF way: Just ask eBPF to tell you what you want by giving it some eBPF-specific instructions.

There's a rundown on how it works here: https://ebpf.io/what-is-ebpf/

uticus · 2024-07-23T16:10:39 1721751039

> eBPF isn't "watching the watchers"…

> …via a very picky sandbox…

When the eBPF is a CrowdStrike mechanism, and eBPF is “picky,” it is clearly “watching the watchers.”

MetaWhirledPeas · 2024-07-22T15:30:18 1721662218

Right? I might spend a few minutes seeing if an AI chatbot can explain all the justifications that lead to using something like CrowdStrike in the first place.

brundolf · 2024-07-22T16:36:21 1721666181

This sounds like a cool technology, but this was the really egregious problem:

> There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general

You don't need a new technology to implement basic industry-standard quality control

__MatrixMan__ · 2024-07-22T14:19:56 1721657996

Maybe we should start taking Fridays off to commemorate the event, which probably would have been less bad if more people spent less time with their nose to the grindstone and had more time to stop and think about how it all was shaping up and how they could influence that shape.

muth02446 · 2024-07-22T16:34:54 1721666094

```The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage. ``` Wow, 20k is not exactly encouraging. Besides the extra attack surface, who can vouch for such a large code base?

haberman · 2024-07-22T19:10:26 1721675426

I had exactly the same thought. I don’t know if that 20k number was supposed to inspire confidence, but for me it did the opposite. It would have inspired confidence if it was 300 lines of code.

My impression is that the WebAssembly verifier is much simpler.

the8472 · 2024-07-22T13:59:10 1721656750

If the filters are loaded at boot and hook into everything then a bug can still lock down the system to a point where it can't be operated or patched anymore (e.g. because you loaded an empty whitelist). So it could end up replacing a boot loop with another form of DoS.

If microsoft includes a hardcoded whitelist that covers some essentials needed for recovery that could make a bug in such a tool easier to fix, but could still cause effective downtimes (system running but unusuable) until such a fix is delivered.

throwaway2037 · 2024-07-22T14:17:46 1721657866

The blog post says:

    > eBPF, which is immune to such crashes.

I tried to Google about this, but I cannot find anything definitive. It looks like you can still break things. Can an expert on eBPF please comment on this claim? This is the best that I could find: https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...

umanwizard · 2024-07-22T14:58:00 1721660280

eBPF programs cannot crash the kernel, assuming there are no bugs in the eBPF verifier. There have been such bugs in the past but they seem to be getting more and more rare.

javierhonduco · 2024-07-22T15:49:39 1721663379

Or in other parts of the kernel. It's been the case in multiple occasions that buggy locking (or more generalised, missing 'resource' release) has caused problems for perfectly safe BPF programs. For example, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398 and the fix https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

umanwizard · 2024-07-22T16:48:39 1721666919

This is actually exactly the bug I was thinking of, so fair point! (I work at PS now and am aware you worked on debugging it a while back).

rwmj · 2024-07-22T15:54:12 1721663652

This isn't really true. eBPF programs in Linux have access to a large set of helper functions written in plain C. https://lwn.net/Articles/856005/

umanwizard · 2024-07-22T16:27:04 1721665624

I don't see how this contradicts what I said. Indeed, there are helpers, but the verifier is supposed to check that the eBPF program isn't calling them with invalid arguments.

queuebert · 2024-07-22T16:45:20 1721666720

I would be very hesitant to say "cannot" in a million-line C code base.

umanwizard · 2024-07-22T16:47:31 1721666851

Yes, bugs in Linux are possible, so there might be some eBPF code that crashes the kernel. Just like bugs in Chrome are possible, so there might be some JavaScript that crashes the browser. Still, JavaScript is much safer than native code, because fixing the bugs in one implementation is a tractable problem, whereas fixing the bugs in all user code is not.

kaliszad · 2024-07-22T16:29:56 1721665796

"These security agents will then be safe and unable to cause a Windows kernel crash."

Unless of course there is a bug in eBPF (https://access.redhat.com/solutions/7068083) @brendangregg and the kernel panics/ BSoDs anyway which you mention later in the article of course.

acdha · 2024-07-23T12:11:42 1721736702

This is true but the kernel gets more scrutiny and has better priorities. Only CrowdStrike audits and hardens the CS kernel driver, so things like proactive improvements are competing in a single Jira board against marketing’s request for new features (want to bet that was all AI until Friday?) whereas the kernel eBPF implementation might be improved by people at other security vendors, distributions like Red Hat or Ubuntu or a major cloud provider (all of whom fund serious security audits and have engineers who care a lot about robustness), or academic researchers.

“Many eyes” is a bit dubious in general but the Linux kernel is pretty much the best case for it being true.

ec109685 · 2024-07-22T16:54:28 1721667268

Benefit of fixing that bug is that all ebpf programs benefit versus every security vendor needing to ensure they write perfect c code.

xg15 · 2024-07-22T12:51:17 1721652677

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Assuming every security critical system will be on a recent enough kernel to support this...

efee22 · 2024-07-22T13:04:53 1721653493

I think with a LTS distribution you should get very far these days when it comes to implementing such sensors.

chasil · 2024-07-22T14:37:11 1721659031

On rhel8 variants, you can use the Oracle UEK to get eBPF.

https://blogs.oracle.com/linux/post/oracle-linux-and-bpf

  $ cat /etc/redhat-release /etc/oracle-release /proc/version
  Red Hat Enterprise Linux release 8.10 (Ootpa)
  Oracle Linux Server release 8.10
  Linux version 5.15.0-203.146.5.1.el8uek.x86_64 (mockbuild@host-100-100-224-48) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9.2.0.1), GNU ld version 2.36.1-4.0.1.el8_6) #2 SMP Thu Feb 8 17:14:39 PST 2024

dredmorbius · 2024-07-22T14:05:53 1721657153

Considering the number of systems running very obsolete OSes these days: WinNT (4x or 3x), Windows, DOS, or various proprietary Unixen, stale Linux flavours, etc., etc., ... yes, quite.

dijit · 2024-07-22T13:07:03 1721653623

And assuming there's no bugs in the BPF code...

Oh wait: https://news.ycombinator.com/item?id=41031699

efee22 · 2024-07-22T13:13:39 1721654019

RHEL kernel.. right. Imho, I'd trust an upstream stable kernel far more than a RHEL one for production which has dozen of feature backports and an internal kABI to maintain.. granted RH has a QA team, but it is still impossible to test everything beforehand.

worthless-trash · 2024-07-22T13:36:24 1721655384

On the upside, non root users can't insert ebpf code, so its a priv'ed operation, not like other distros.

nequo · 2024-07-22T14:18:19 1721657899

Isn’t it tied to CAP_BPF on every distro since the 5.8 kernel?

https://mdaverde.com/posts/cap-bpf/

worthless-trash · 2024-07-23T08:20:34 1721722834

Rhel8 is based on 4.18 RHEL9 is based on 5.14 , i think it still has the same restriction ( kernel.unprivileged_bpf_disabled ).

I reckon Red Hat may duplicate upstreams behavior by RHEL10.

blinkingled · 2024-07-22T13:32:21 1721655141

Ok. But the good old push code to staging / canary it before mainstream updates was a simpler way of solving the same problem.

Crowdstrike knows the computers they're running on, it is trivial to implement a system where only few designated computers download and install the update and report metrics before the update controller decides to push it to next set.

Archelaos · 2024-07-22T13:43:48 1721655828

It would mitigate the problem, but not solve it. You can still imagine a condition that only occurs after the update has been rolled out everywhere. Furthermore, such a bug would still be extremely problematic for the concerned customers, even if not all of them were affected. In addition, it would be necessary to react very quickly in the case of zero-day vulnerabilities.

blinkingled · 2024-07-22T14:18:18 1721657898

Yes, I am not arguing against having the ability to deal with it quickly - I am saying canary/ staging helps you do exactly that. Because as we see in the case of Intel CPUs and Crowdstrike some problems or scale of some problems is best prevented.

tantalor · 2024-07-22T13:55:52 1721656552

(semantic argument warning)

"Mitigation" is dealing with an outage/breakage after it occurs, to reduce the impact or get system healthy again.

You're talking about "prevention" which keeps it from happening at all.

Canarying is generic approach to prevention, and should not be skipped.

Avoiding the risk entirely (eBPF) would also help prevent outage, but I think we're deluding ourselves to say it "solves" the problem once and for all; systems will still go down due to bad deploys.

rldjbpin · 2024-07-23T07:55:43 1721721343

with the way they handled the debian crashing a little while ago, frankly they are happy to still go ahead with testing this way. still much better way to handle things than pushing to everybody at the same time.

phartenfeller · 2024-07-22T13:49:30 1721656170

Why trust somebody else not messing up? With that in place for windows and crowdstrike billions of dollars would be saved and many lives not negatively impacted ...

skywhopper · 2024-07-22T13:25:44 1721654744

The implicit assumption of the article is that eBPF code can't crash a kernel, but the article itself eventually admits that it can and has done, including last month. eBPF is a safer way of providing kernel-extension functionality, for sure, but presenting it as the perfect solution is just asking to have your argument dismissed. eBPF is not perfect. And there's plenty of things it can't do. The very sandbox rules that limit how long its programs may run and what they can do also make it entirely inappropriate for certain tasks. Let's please stop pretending there's a silver bullet.

efee22 · 2024-07-22T13:37:12 1721655432

It's not a silver bullet, however, it is still better to pushing all the panicable bugs into one community-maintained section (e.g. eBPF verifier). All vendors have an incentive to help get right and this is much better than every vendor shipping their own panicable bugs in their own out of tree kernel modules. Additionally, it's not just the industry looking at eBPF, but also academia in terms of formally verifying these critical sections.

lucianbr · 2024-07-22T13:58:06 1721656686

"Improves kernel stability" is great. "Prevents kernel crashes" is a plain lie.

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code.

Come on. Computers will continue to crash in the future, even when using eBPF. I am quite certain.

lucianbr · 2024-07-22T13:56:16 1721656576

It's casually claiming to have solved the halting problem, at least within some limited but useful context. That should be impossible, and it turns out, it is.

I expect it can be solved within some limited contexts, but those contexts are not useful, at least not at the level of "generic kernel code".

red_admiral · 2024-07-22T14:21:59 1721658119

It solves the halting problem by not being Turing complete. I presume each eBPF runs in a context with bounded memory, requested up front, for one thing; it also disallows jumps unless you can prove the code still halts.

michaelt · 2024-07-22T14:33:56 1721658836

eBPF started out as Berkeley Packet Filters. People wanted to be able to set up complex packet filters. Things like 'udp and src host 192.168.0.3 and udp[4:2]=0x0034 and udp[8:2]=0x0000 and udp[12]=0x01 and udp[18:2]=0x0001 and not src port 3956'

So BPF introduced a very limited bytecode, which is complex enough that it can express long filters with lots of and/or/brackets - but which is limited enough it's easy to check the program terminates and is crash-free. It's still quite limited - prior to ~2019, all loops had to be fully unrolled at compile time as the checker didn't support loops.

It turned out that, although limited, this worked pretty well for filtering packets - so later, when people wanted a way to filter all system calls they realised they could extend the battle-tested BPF system.

Nobody is claiming to have solved the halting problem.

lucianbr · 2024-07-22T15:44:00 1721663040

Did you read the article? It says computers will not crash in the future due to updates. It literally says that in the very first line of the article.

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

What you are claiming is completely different. A kind of "firewall" for syscalls. But updates to drivers and software must contain code and data. The author is not talking about updates to the firewall between drivers and the kernel, they talk about updating drivers themselves. It literally says "updates that involve kernel code". Will the kernel only consist of eBPF filtering bytecode? How could that possibly work?

lazycog512 · 2024-07-22T17:02:01 1721667721

"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair."

- Douglas Adams

nkozyra · 2024-07-22T14:09:21 1721657361

I don't do any kernel stuff so I'm out of my element, but doesn't the fact that Crowdstrike & Linux kernel eBPF already caused kernel crashes[1] sort of downplay the rosiness of the state of things?

[1]: https://access.redhat.com/solutions/7068083

guipsp · 2024-07-22T14:39:06 1721659146

This is specifically addressed in the post you are replying to

nkozyra · 2024-07-22T22:08:11 1721686091

Can you elaborate? What I see about Linux is that Crowdstrike was in the process of adopting eBPF which is ostensibly immune to kernel panics, but that issue shows their eBPF implementation specifically causing a kernel panic.

olddustytrail · 2024-07-22T23:05:06 1721689506

Yes, the elaboration is that the same link you posted is included in the article you're supposed to have just read.

nkozyra · 2024-07-23T18:25:43 1721759143

I've read it three times now. The only thing they say about it is this:

"This doesn't mean that eBPF has solved nothing, substituting a vendor's bug for its own. Fixing these bugs in eBPF means fixing these bugs for all eBPF vendors, and more quickly improving the security of everyone."

Which is exactly what I'm asking about. If eBPF has some inherent advantage, why did it fail in precisely the same way alreay?

kjellsbells · 2024-07-22T23:15:27 1721690127

Lets suppose that eBPF solves this particular problem, eventually, for Windows. Doesn't sidestepping the entire class of Crowdstrike-style fubars require that Microsoft then mandate that no, backward compatibility will not be offered?

Back compat seems to be such a shibboleth in the Windows world, but comes at an incredible price. The reasons cited all seem to boil down to keeping some imagined customers' obscure LOB app running for decades. But that seems like an excuse to me. Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?

acdha · 2024-07-23T12:20:10 1721737210

Backwards compatibility slows things down in the Windows world but it doesn’t halt improvements. In this case, there are two powerful ratchets:

1. Compliance: everyone affected by this bug has auditors. Once safer alternatives are available, the standards like CIS, PCI, etc. will be updated to say you should use the new interface, and every enterprise IT department will have pressure to switch to eBPF tools. We saw this with BootLocker: storage encryption used to be a pain, people resisted it, but over time it became universal because the cost of swimming upstream was too high.

2. Signing. Microsoft can start requiring more proof of need and restrictions for signing drivers. They have to be careful to avoid the appearance of favoritism but after this debacle that’s a LOT easier. I would bet some engineer is working on a draft of mandatory fault handling and testing proof requirements for critical kernel drivers now and I would not be surprised to see it include a timeframe for adopting memory-safe languages.

another2another · 2024-07-24T08:54:13 1721811253

>Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. >Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?

If your code somehow still relies on some buggy behaviour to work, then MS shouldn't do anything to preserve that anymore - apparently they used to, but I'm not so sure nowadays.

However 'ancient NT' code should probably still function just fine since the Win32 API hasn't changed much for a while, and MS don't actively deprecate function calls (unlike Apple who seem to do it a bit on a whim recently). I would put this down to the API being pretty well designed in the first place.

pas · 2024-07-22T23:58:16 1721692696

it would be enough if MS offered knobs and switches for admins/devs/vendors to disallow non-static-verified stuff in the kernel

xyzzy123 · 2024-07-22T13:46:51 1721656011

So many problems though! including commercial monocultures, lack of update consent, blast radius issues, etc etc. There's a commons in our pockets but that is very difficult to regulate for. The will keep putting the gun to your head until you keep choosing the monoculture.

shahahqq · 2024-07-22T13:57:42 1721656662

worrisome indeed that now the world knows how many users are affected by crowdstrike so the bad guys just need to poke deeper there

titzer · 2024-07-22T18:38:32 1721673512

WebAssembly is a better choice for sandboxing kernel code. It has a full formal specification with a mechanized proof of type safety, many high-performance implementations, broad toolchain support, is targetable from many languages, and a capability security model.

rapidlua · 2024-07-23T09:10:31 1721725831

Hardly. For starters, wasm doesn’t guarantee that a piece of code terminates in bound time. There are further security guarantees in ebpf such as any lock acquired must be released.

jules · 2024-08-06T00:10:19 1722903019

The eBPF termination checker is buggy anyway; you cannot rely on it.

titzer · 2024-07-24T16:41:31 1721839291

You can apply additional static checks to Wasm, e.g. control flow analysis, and reject programs without obvious loop bounds or unbalanced locking operations. Or you could apply dynamic techniques like tracking acquired locks and automatically releasing them, or charging fuel (gas). The latter is quite common for blockchain runtimes based on Wasm.

3np · 2024-07-23T00:24:47 1721694287

> The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory.

This is obviously not true. It might be the worst it can do, by itself, to the currently running kernel. It's not the worst it can do to the machine or its user(s).

There are infinite harmful things an eBPF program can do. As can programs solely in user-space. There is a specific class of vulnerabilities being mitigated by moving code from kernel to BPF. That does not mean that eBPF programs are in general safe.

usrme · 2024-07-22T12:52:12 1721652732

Does anyone know how far along the eBPF implementation for Windows actually is? In the sense that it could start feasibly replacing existing kernel drivers.

tgtweak · 2024-07-22T18:01:42 1721671302

Even if Microsoft rolls out eBPF and mainstreams it - it will be years before everything is ported over and it still won't address legacy windows versions (which appear to be a good chunk of what was impacted).

It's a move in the right direction but it probably won't fully mitigate issues like this for another 5+ years.

acdha · 2024-07-23T12:22:26 1721737346

Sure, but 5 years is not that long ago - for example, if they’d started right before the pandemic it’d be almost done by now. The best time to have done that was 5 years ago but the second best time is now.

CodeWriter23 · 2024-07-22T14:04:19 1721657059

> an unprecedented example of the inherent dangers of kernel programming

I take issue with that. Kernel programming was not to blame; looking up addresses from a file and accessing those memory locations without any validation is. The same technique would yield the same result at any Ring.

lucianbr · 2024-07-22T14:05:27 1721657127

Obviously in userspace it would only crash the running program and not the entire operating system? It's a significant difference.

All of the service interruptions would have been just "computer temporarily not protected by crowdstrike agent". Not the same thing at all.

chrisjj · 2024-07-23T11:52:30 1721735550

> Obviously in userspace it would only crash the running program and not the entire operating system? It's a significant difference.

Significant and often far worse. It would leave the machine running unprotected.

CodeWriter23 · 2024-07-22T14:14:58 1721657698

> It's a significant difference.

When various apps running the world are crashing, unable to execute because malware protection is failing, there is no difference.

macobrien · 2024-07-22T14:52:10 1721659930

_No_ difference oversells it, IMO -- the fact that the entire OS crashed is what made fixing the bug so arduous, since it required in-person intervention. To be sure, running the code in userspace would still cause unacceptable service interruptions, but the fix could be applied remotely.

nine_k · 2024-07-22T14:07:46 1721657266

At Ring 3 it would crash an app, not the entire OS.

Yes, the kernel is fine and is not to blame. But running basically a rootkit controlled by a third party indeed is to blame.

CodeWriter23 · 2024-07-22T14:13:42 1721657622

> At Ring 3 it would crash an app, not the entire OS.

That's still an outage for those key systems.

nequo · 2024-07-22T14:54:44 1721660084

It is an outage for the monitoring system, not the system that it monitors.

CodeWriter23 · 2024-07-22T23:07:55 1721689675

I think a reasonable protocol is to stop using any apps when your cyber protection crashes. Why have that suite at all otherwise?

nequo · 2024-07-25T05:18:08 1721884688

I agree for some systems. For others, stopping the system has bigger consequences than not having cyber protection for a few hours because of a bug that’ll get fixed. For example, hospitals, or possibly Delta Airlines.

dwattttt · 2024-07-22T16:33:22 1721666002

FWIW their configuration files can't be holding addresses; those have been randomised in the kernel for at least a decade

twen_ty · 2024-07-22T14:00:00 1721656800

Can someone tell me what's the advantage of eBPF over a user mode driver? The article makes it look it eBPF is have your cake and eat it too solution which is too good to be true? Can you run graphics drivers in eBPF for example?

bewo001 · 2024-07-22T14:56:25 1721660185

AFAIK, an ebpf function can only access memory it got handed as an argument or as result from a very limited number of kernel functions. Your function will not load if you don't have boundary checks. Fighting the ebpf validator is a bit like fighting Rust's borrow checker; annoying, at times it's too conservative and rejects perfectly correct code, but it will protect you from panics. Loops will only be accepted if the validator can prove they'll end in time; this means it can be a pain to make the validator to accept a loop. Also, ebpf is a processor-independent byte code, so vectorizing code is not possible (unless the byte code interpreter itself does it).

Given all its restrictions, I doubt something complex like a graphics driver would be possible. But then, I know nothing about graphics driver programming.

umanwizard · 2024-07-22T15:01:36 1721660496

> Fighting the ebpf validator is a bit like fighting Rust's borrow checker

I think this undersells how annoying it is. There's a bit of an impedance mismatch. Typically you write code in C and compile it with clang to eBPF bytecode, which is then checked by the kernel's eBPF verifier. But in some cases clang is smart enough to optimize away bounds checks, but the eBPF verifier isn't smart enough to realize the bound checks aren't needed. This requires manual hacking to trick clang into not optimizing things in a way that will confuse the verifier, and sometimes you just can't get the C code to work and need to write things in eBPF bytecode by hand using inline assembly. All of these problems are massively compounded if you need to support several different kernel versions. At least with the Rust borrow checker there is a clearly defined set of rules you can follow.

chasil · 2024-07-22T14:16:30 1721657790

This is the wiki. I haven't kept up, but this isn't a kernel module.

"eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter (BPF, with the "e" originally meaning "extended") filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well."

https://en.wikipedia.org/wiki/EBPF

tptacek · 2024-07-22T14:08:38 1721657318

No, you can't run arbitrary general-purpose programs in eBPF, and you cannot run graphics drivers in it. You generally can't run programs with unprovably bounded loops in eBPF, and your program can interact with the kernel only through a small series of explicitly enumerated "helpers" (for any given type of eBPF program, you probably have about 20 of these in total).

chrisjj · 2024-07-23T11:49:12 1721735352

> You generally can't run programs with unprovably bounded loops in eBPF

Surely that bars CrowdStrike's check for unprovably bounded vulnerabilities.

Yawrehto · 2024-07-22T14:41:06 1721659266

1. How does eBPF solve this? It makes it more difficult, sure, but it'll almost always be possible to cause a crash, if you try hard enough. 2. More importantly, the problem is rarely fixable by changing technology, because typically, problems are caused by people and their connections: social/corporate pressures, profit-seeking, mental health being treated as unimportant, et cetera. eBPF can't fix those, and as long as corporations have social structures that penalize thoroughness and caution, and incentivize getting 'the most stuff' done, this will persist as a problem.

umanwizard · 2024-07-22T14:50:18 1721659818

> it'll almost always be possible to cause a crash, if you try hard enough.

If you think you know a way to crash the Linux kernel by loading and running an eBPF program, you should report a bug.

tracker1 · 2024-07-22T16:15:18 1721664918

I don't buy it... didn't a bug from RedHat + Crowdstrike have a similar panic issue? I understand in that case it was because of RedHat, but still. I don't think this, by itself will change much.

WaitWaitWha · 2024-07-22T14:01:58 1721656918

eBPF == extended Berkeley Packet Filter

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter

kayge · 2024-07-22T15:48:19 1721663299

Thanks! This was not a familiar acronym to me... and after some digging[0] apparently it's no longer an acronym:

"BPF originally stood for Berkeley Packet Filter, but now that eBPF (extended BPF) can do so much more than packet filtering, the acronym no longer makes sense. eBPF is now considered a standalone term that doesn’t stand for anything."

[0] https://ebpf.io/what-is-ebpf/

dveeden2 · 2024-07-22T16:32:55 1721665975

So eBPF is giving us eBFP (enhanced Blue Friday Protection)?

mschuster91 · 2024-07-22T14:10:55 1721657455

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers.

How about Microsoft's large government and commercial customers make it a requirement that MS does not develop a single new feature for the next two fucking years or however long it takes to go through the entirety of the Windows+Office+Exchange code base and to make sure there are no security issues in there?

We don't need ads in the start menu, we don't need telemetry, we don't need desktop Outlook becoming a rotten slow and useless web app, we don't need AI, we certainly don't need Recall. We need an OS environment that doesn't need a Patch Tuesday where we have to check if the update doesn't break half the canary machines.

And while MS is at that they can also take the goddamn time and rework the entire configuration stack. I swear to god, it drives me nuts. There's stuff that's only accessible via the registry (and there is no comprehensive documentation showing exactly what any key in the registry can do - large parts of that are MS-internal!), there's stuff only accessible via GPO, there's stuff hidden in CPLs dating back to Windows 3.11, and there's stuff in Windows' newest UI/settings framework.

jeffrallen · 2024-07-23T06:45:04 1721717104

Here's an idea for an interesting hack: a piece of kernel resident code that feeds fake data into eBPF so that an eBPF-based antimalware will see nothing bad as the malware goes about it's merry way.

Sandboxes are safe, but are ultimately virtual machines, and virtual machines can be made to live in a world that's not real.

yubiox · 2024-07-23T05:23:38 1721712218

Title reminds me of when microsoft promised no more UAEs back in 92. They just renamed them to GPFs in windows 3.1.

egorfine · 2024-07-22T17:18:22 1721668702

One option to prevent this is to not run corporate spyware. But I guess for some industries this isn't an option.

supriyo-biswas · 2024-07-23T04:07:55 1721707675

I don’t understand statements like this. You only need to have some employee install some malware (unintentionally or otherwise); and you have a data breach on your hands.

egorfine · 2024-07-23T07:33:14 1721719994

I agree it's much more scalable to have a vendor install a spyware on all your workstations and have a centralized data breach.

chrisjj · 2024-07-23T11:50:59 1721735459

Oh, I wouldn't say that. Viruses are highly scalable.

datadeft · 2024-07-22T19:02:08 1721674928

It is great that we need a linux kernel feature to be ported to Windows so we don’t have blue Fridays

CoastalCoder · 2024-07-22T12:54:32 1721652872

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Are they saying that device drivers should be written in eBPF?

Or maybe their drivers should expose an eBPF API?

I assume some driver code still needs to reside in the actual kernel.

prmoustache · 2024-07-22T13:08:21 1721653701

These tool wouldn't need kernel drivers, only to target the eBPF userspace API: https://www.kernel.org/doc/html/latest/userspace-api/ebpf/in...

wiresurfer · 2024-07-22T19:18:54 1721675934

Hey Brendan,

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Windows soon, may still be atleast a year ahead. Would that be a fair statement? atleast being the operating keyword here.

Specifically in the context of network security software, for eBPF programs to be portable across windows/linux, we would need MSFT to add a lot more hooks and expose internal kernel stucts. Hopefully via a common libbpf definition. Otherwise, I fear, having two versions of the same product, across two OSs would mean more secuirty and quality issues.

I guess the point I am trying to make is, we would get there, but we are more than a few years away. I would love to see something like cilium on vanilla windows for a Software defined Company Wide network. We can then start building enterprise network secutiry into it. Baby steps!

---

btw, your talks and blog posts about bpftools is godsent!

vfclists · 2024-07-22T13:30:15 1721655015

Yep, another fix to all our problems, a new bandwagon to be jumped on by wall EDR vendors, until ...

Here I am using the term "EDR". Until this CrowdStrike debacle I'd never heard it.

Only tells how seriously you should take my opinions.

throw0101d · 2024-07-22T16:32:30 1721665950

Meta:

> eBPF (no longer an acronym) […]

Any reason why the official acronym was done away with?

riskable · 2024-07-22T17:05:49 1721667949

Because it used to stand for extended Berkeley Packet Filter and it has since moved far, far beyond just packets. It now hooks into the entire network stack, security, and does observability/tracing for nearly anything and everything in the kernel ("nearly" because some stuff runs when the kernel boots up--before eBPF is loaded--and never again after that).

sandywaffles · 2024-07-22T17:49:45 1721670585

Because eBPF is no longer just packet filtering? It's now used in loads of hook pionts unrelated to packets or filtering at all.

Jedd · 2024-07-23T02:57:51 1721703471

Technically it was never an acronym - rather an initialism or abbreviation.

ninju · 2024-07-24T18:33:11 1721845991

So a couple of questions

1) Is CrowdStrike Falcon using eBPF for their Linux offering?

2) Would the faulty patch update get caught by the eBPF verifier?

rezonant · 2024-07-22T17:08:34 1721668114

> the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes

Oh I'm sure they'll find a way.

fullspectrumdev · 2024-07-22T22:21:11 1721686871

This puts an awful lot of stock in the robustness of eBPF.

Which is odd, given there’s been a bunch of kernel privesc bugs using eBPF…

0xbadcafebee · 2024-07-22T17:54:53 1721670893

> In the future, computers will not crash due to bad software updates

I'm still waiting on my flying car...

ksec · 2024-07-22T18:07:38 1721671658

The article mentions Windows and Linux. Does anyone know if there will be eBPF for FreeBSD?

Scene_Cast2 · 2024-07-22T13:42:46 1721655766

How much extra security does this provide on top of HLK?

userbinator · 2024-07-23T04:51:26 1721710286

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code.

100% BS. Even if they don't "crash" they will "stop functioning as intended" which is just the same. It's absolutely disgusting how this industry is now using this one outage as a talking point to further their totalitarian agenda.

It reminds me of how Google went after adblockers with their new extension model that also promised more "security". It's time we realised what they're really trying to do. In fact, I wonder whether this outage was not accidental after all.

klooney · 2024-07-22T15:23:13 1721661793

First io_uring, now eBPF. Kind of wild.

asynchronous · 2024-07-22T12:57:24 1721653044

Is there a reason for the lack of naming+shaming Crowdstrike in this blogpost? Was it to not give them any more publicity, good or bad?

StevenWaterman · 2024-07-22T12:59:20 1721653160

If you consider kernel programming to be inherently unsafe, then you would consider this to be inevitable, meaning it's not really the specific company's fault. They were just the unlucky ones.

brendangregg · 2024-07-22T13:10:52 1721653852

Right, and we wanted to talk about all security solutions and not make this about one company. We also wanted to avoid shaming since they have been seriously working on eBPF adoption, so in that regard they are at the forefront of doing the right thing.

lordnacho · 2024-07-22T13:08:01 1721653681

They could have helped their luck by doing some of the common sense things suggested in the article.

For instance, why not find a subset of your customers that are low risk, push it out to them, and see what happens? Or perhaps have your own fleet of example installations to run things on first. None of which depends on any specific technology.

hello_moto · 2024-07-22T13:45:29 1721655929

"find a subset of low risk customers" and use them as test subject?

Repeat that a few times to understand the repercussions.

If I were the customers and I found out that I was used as test subject, how would I feel?

lordnacho · 2024-07-22T14:07:11 1721657231

> If I were the customers and I found out that I was used as test subject, how would I feel?

In reality, every business has relationships that it values more than others. If I wasn't paying a lot for it, and if I was running something that wasn't critical (like my side project) then why not? You can price according to what level of service you want to provide.