I wrote the email that prompted this quite civil response. I'm very pleased with the outcome, because I think this clear statement of his position is a lot more useful for people to work with, rather than just assuming Linus hates security or something.
I interpreted his response in practical terms as essentially being the following. Patch set merge 1 has "report" as default and "kill" as a non-default option. Patch set merge 2 has "kill" as default and "report" as a non-default option. Patch set merge 3 removes support for "report". This way we have the best of both worlds: we eventually reach the thing that actually adds real security benefit, which makes security folks happy. And we don't break everybody's computers immediately, allowing time for the more obvious bugs to surface via "report", which makes users and developers happy. Seems like a reasonable process to me.
Though, I think that the time between PSM1 and PSM2 will be significant. Usually default options are changed once basically all distros compile with another option without widespread breakage. And once no LTS kernel with PSM1 is supported, you merge PSM3.
Might take years but atleast the airplanes keep flying instead of crashing their computers and consequently themselves.
I think the most important point here is that with those different patch sets, the more security-conscious users/companies get to have the properly hardened version immediately into use, rather than running the "only report" versions for what might be years, as you say.
Granted, I'm looking at this from a perspective where our company compiles our own kernel for use in embedded devices, so we can have whatever patch sets supported we want. But I think that's much better than everyone having to use the "report only" patches for years, or even worse, the features never getting into the kernel in the first place.
The other important "users" are the developers and drive-by-developers of those user space processes which may accidentally trigger these bugs. These folks are highly likely to be able to resolve these issues if they are reported [in a way which is visible to them].
Counterpoint: most developers and users are not actively following their logs at any level, and maybe something should be done to make it more common. Possibly stderr logging of such errors in libc (or some other commonly used library which already sometimes logs errors on its own, like glib). 3:- )
Is that really a position Linus has maintained for a long time? Because I got the feeling Linus really just hated anything to do with security.
It's only in more recent years when the automotive and IoT industries have started to get involved in the Linux Foundation and them asking for more security features that he seems to have tried to find ways to "compromise" with security people.
You're going to get a very different perspective depending on if you actually lurk lkml or just read when something gets linked by socail media or "news" that needs to wrap it in a hot take to make lookies loos interested in lkml.
Where he's quite civil and explains quite clearly why he won't accept the patch, and what should happen for the patch to be accepted. The Kees's reply insisting on the merge is what led to the profanity-laden email, and frankly I understand that (not condone, but understand from a human-reaction point of view) - how many times does one have to iterate his viewpoint to others to make himself heard?
> IT IS NOT ACCEPTABLE when security people set magical new rules, and then make the kernel panic when those new rules are violated.
> That is pure and utter bullshit. We've had more than a quarter century _without_ those rules, you don't then suddenly walz in and say "oh, everbody must do this, and if you haven't, we will kill the kernel".
> The fact that you "introduced the fallback mode" late in that series just shows HOW INCREDIBLY BROKEN the series started out.
This makes perfect sense and outlines a real problem with security patching in general.
Sadly this kind of magical security thinking have many proponents higher up in the Linux stack, and they have the backing/support of GKH. Thus i worry what will happen the day Linus give up the reins.
Background: the "kernel self protection project" (KSSP) recently upstreamed the Grsecurity/PAX reference counting implementation which prevents a certain class of security bugs from being exploited.
Grsecurity is a security hardening patchset for Linux that makes deliberate trade-offs in favor of security, sacrificing availability if necessary. This, aside from the political issue, is the main reasons why it's hard to upstream it. Linus has called some of their mitigations "insane" before precisely for that reason. Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).
Unfortunately, Grsecurity/PAX is not (and probably won't ever be) involved in the KSSP project, and the KSSP developers do not understand the code nearly as well as the Grsecurity team does. This lead to a situation where the new code caused a crash that they weren't able to fix in time, so they disabled the feature in the last minute.
I've been using Grsecurity for years until they stopped making it publicly available, and I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours.
Grsecurity/PAX have invented many of the modern exploitation mitigations, probably second to none. Some have even been implemented in hardware. Their expertise in building modern defenses is astonishing (their latest invention, the control flow integrity mechanism RAP, is a work of art).
Linux could be the most secure kernel, instead, it's fallen way behind Windows - which has much better defenses than Linux nowadays thanks to Microsoft's ongoing battle with rootkit writers. Go figure.
If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.
>> This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).
Let's rephrase that. This is exactly what you want if you care only about security, or care about security above everything else - including your system running at all. People run software for reasons, and they need it to keep running for those reasons. The security folks are not really qualified to evaluate the security risks against all the reasons for all the people running Linux.
> From a security standpoint, when you find an invalid access, and you
mitigate it, you've done a great job, and your hardening was
successful and you're done. "Look ma, it's not a security issue any
more", and you can basically ignore it as "just another bug" that is
now in a class that is no longer your problem.
So to you, the big win is when the access is _stopped_. That's the end
of the story from a security standpoint - at least if you are one of
those bad security people who don't care about anything else.
But from a developer standpoint, things _really_ are not done. Not
even close. From a developer standpoint, the bad access was just a
symptom, and it needs to be reported, and debugged, and fixed, so that
the bug actually gets corrected.
As a developer, I do want the report. But if you killed the user
program in the process, I'm actually _less_ likely to get the report,
because the latent access was most likely in some really rare and
nasty case, or we would have found it already.
Fixing all memory corruption bugs is infeasible without fundamentally changing the way Linux is developed. There is so much code (and it’s being added to, changed, etc.) written by humans that make mistakes.
There will always be some bugs that are in between being discovered (by someone, maybe malicious, maybe not), and being fixed. How else do you prevent against vulnerabilities in that stage?
Linus' response is that calling it an infeasible problem is a cop-out. The right way to go about it is to fix them all, incrementally if need be, and not break userland in the process.
These comments sound analogous to real world security and societal issues. Like, the desire to increase army size and addressing the underlying issues.
One is a short term solution, the other long term.
>If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.
It's more that Grsecurity is working against everyone else. They want to pretend the GPL works in a way that it doesn't so that they can sell their patches. Then they make threats to people who say "that's not how the GPL works" and distribute their patches in accordance with how the GPL actually works. That's not how kernel development is done. I'd rather have an insecure kernel than their bullshit.
Bruce Perens, however, is entitled to not be the victim of Spender's legal harassment for exercising his First Amendment rights to disagree [1].
I have no dog in this fight whatsoever; I don't know anybody involved. But in general I have little sympathy for people, however talented, who waste taxpayers' money with bogus legal action.
No, it's not. The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing. If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.
> If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.
That would be a work for hire, and it is not the same thing as developing patches independently and distributing them with extra terms, because there is no distribution involved.
> The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing.
That would be true if the patches were not derivative works of the Linux kernel in a legal sense. I'm no lawyer, but that seems contrary to the plain meaning of "derivative work".
Not necessarily, in fact perhaps not even usually.
1. The consulting contract might or might not provide for the client to own any work product created. Many such contracts provide that the client will own only the specific end product, while the consultant retains ownership of any reusable "Toolkit Items."
But what if the contract is silent about ownership of consulting work product?
2. As to copyright: Under U.S. copyright law, the default mode is that IF: An original work of authorship is created outside an employer-employee relationship, THEN: The copyright is owned by the individual author (or jointly by multiple co-authors) UNLESS: A) the work of authorship falls into one of nine specific statutory categories, and B) the parties have expressly agreed in writing, before the work was created, that it would be a work made for hire. [0] [1]
3. Any patentable inventions would be owned by the inventor(s) unless they were employees who were "hired to invent" or "set to experimenting," in which case the inventions would be owned by the employer; so far as I recall, this doesn't apply in the case of outside-contractor consulting projects — the client would not own any resulting inventions unless the contract specifically said otherwise. [2]
[1] "A 'work made for hire' is—(1) a work prepared by an employee within the scope of his or her employment; or (2) a work specially ordered or commissioned [A] for use as a contribution to a collective work, [B] as a part of a motion picture or other audiovisual work, [C] as a translation, [D] as a supplementary work, [E] as a compilation, [F] as an instructional text, [G] as a test, [H] as answer material for a test, or [I] as an atlas, if the parties expressly agree in a written instrument signed by them that the work shall be considered a work made for hire. [¶] For the purpose of the foregoing sentence, a 'supplementary work' is a work prepared for publication as a secondary adjunct to a work by another author for the purpose of introducing, concluding, illustrating, explaining, revising, commenting upon, or assisting in the use of the other work, such as forewords, afterwords, pictorial illustrations, maps, charts, tables, editorial notes, musical arrangements, answer material for tests, bibliographies, appendixes, and indexes, and an 'instructional text' is a literary, pictorial, or graphic work prepared for publication and with the purpose of use in systematic instructional activities." From https://www.law.cornell.edu/uscode/text/17/101
> Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).
I'd also like my kernel to halt whenever an assertion does not hold, for the sake of keeping my sanity; not just for security.
For the same reason people drive with their “check engine” light on: It’s frequently better to have a working system (i.e. “I’m late for work”), than to chase an indicator that may not represent a real problem (an actual security intrusion).
I can't think of single useful piece of software nowdays that is exposed to public and can't run in active-active load balanced or clustered scenario. If your kernel/system/userland-app misbehaves it simply needs to be shut down, reported and examined.
It might have been some random memory block the last time your app made an buffer overflow, but it could as well be the stack pointer next time...
Remember, we're necessarily just talking about servers here; every single hospital has mission-critical client machines that cannot go down and obviously those aren't load balanced or clustered. (Though mostly they seem to be running Windows.)
So what happens when your browser crashes? I experience that on a regular basis. Id' rather have my browser crash/killed instead of slowly overwriting my filesystem buffers or corrupting my stack pointer...
Other than that browser are multi-thread/process applications. Usually only a single tab or a plugin crashes unless core browser process is affected. Most users would accept the trade off between crashed browser and infected/corrupted system.
> Most users would accept the trade off between crashed browser and infected/corrupted system.
Most users are using computing devices a means of getting stuff done. They don't want to spend any energy thinking about how their software works, they want their devices to be invisible, which they use to run their Apps uninterrupted. The trade-off is whether to let Apps continue running vs hard crashing and taking down all the work they've done and all the mental energy and focus invested up to that point. If their Apps frequently crash most users aren't thinking, well I'm super glad the hours I spent on this paper I'm working on is now lost, the phone calls to my loved ones or movie I'm watching are abruptly terminated because someone's policy on hard crashing when a bug is found has been triggered. Their preferences and purchasing power are going to go towards non user-hostile devices they perceive provide the best experience for using their preferred Apps without any need for pre-requisite knowledge of OS internals.
There's not a single computing device that frequently crashes as a result of security hardening that will be able to retain any meaningful marketshare. Users are never going tolerate anything that requires extraneous effort on their part into researching and manually applying what needs to be done to get their device running without crashing.
Apps are supposed to keep their state either by saving your work regularly to persistent media or keeping your data off-client. We're living in 21st century in a cloud era FFS.
Keep running your app although integrity corruption within the application happened is putting user data at risk. IMHO an application that corrupts 3 days long presentation file save is to every user more frustrating than the one that crashes due to error leaving you with 5 minutes of unsaved changes lost.
Microsoft have invented "Application Recovery and Restart" exactly for this purpose.
> Keep running your app although integrity corruption within the application happened is putting user data at risk.
If user data is continually backed up to a remote site it's not going to be at risk from a local bug is it? Bugs exist in all software, Users are going to be be more visibly frustrated from their Apps frequently crashing then the extremely unlikely scenario where a detected bug corrupts their "3 days long presentation". They're going very unhappy if the cause of their frequent data loss was due to a user-hostile setting to hard crash on the first detectable bug.
> Microsoft have invented "Application Recovery and Restart" exactly for this purpose.
From Microsoft website:
> An application can use Application Recovery and Restart (ARR) to save data and state information before the application exits due to an unhandled exception or when the application stops responding.
i.e. restarting Apps due to "unhandled exception or when the application stops responding" in which case the App is in an unusable state and ARR kicks in to try auto recover it for minimal user disruption. The focus on providing a good UX, not a miserable crash-prone experience where users use their devices in fear that at anytime anything they're working on can be terminated abruptly without warning.
You clearly have limited view on application bugs. Let me elaborate a bit on bugs causing application dissatisfaction and UX frustration without crashing much, much worse than a simple error message along the lines: "OS has terminated application X because it has performed an illegal operation."
Data corruption - reading or writing corrupted data - files cannot be read, saved files get corrupted, API calls from/to external applications/systems fail or pass incorrect data
Rendering problems - corrupted images, incorrect colors, improper content encoding, visual stuttering, audio deformation, audio skipping
Input/output lags - unregistered kaystrokes, missed actions and responses to external events, mouse stuttering and misbehavior
Improper operation - inconsistent results - repeated rendering yields different results (html), formulas/calculation results in data is inconsistent (excel, DWH)
Access violation - access gained to invalid or protected areas - unprivileged access, license violations, access to areas protected by AAA, data theft (SQL injection, database dumps)
and others. If I figure out the application I'm using (web-browser) allowed a hacker to steal my data he would not have otherwise access to I would be more pissed off than if it crashed and I found an error about it in system log.
the standard windows user will not read the system log.
some people just use computer to do stuff to them there is little difference between "i lost my work because of a bug" and "i lost my work because of a security policy"; from a UX point of view both of them are the developer fault for releasing inadequate software.
Meta: Who, and why, flagged this comment? What rule exactly Slavius breaks here?
On topic: I can't recall now the details, but I read a paper once about a system which had no shutdown procedure at all, the only way to exit it was to crash it somehow or just shutdown the computer. The system made sure to save everything often enough and made sure to store the data in ways which allowed for restoring possibly corrupted parts of it on the next startup. This design produced a very resilient architecture which worked well for that use case.
The paper was from '80s or '90s, so it's not like we need to be in 21st century to design that way. I'll try searching for the paper later.
It's similar in effect, but Erlang's ultimate response to the errors is redundancy instead of trying to salvage whatever was left by the process that crashed. I think the transparent distribution of Erlang nodes over the network is what enables Erlang's "let it crash and forget it ever ran" approach. Joe Armstrong said that they want Erlang to handle all kinds of problems, up to and including "being hit by a lightning" - so I think hardware redundancy is the right path here.
The OS[1] I've been talking about was primarily concerned with a single-machine environment, which resulted in slightly different design.
That is very unlikely. Crashing would happen 100% of the time though. Most people want that trade-off (meaning: If their browser would crash, they would switch to another one, even it was less secure).
Corrupting SP is part of almost every exploit and I can guarantee you that it is very likely (going to cause harm on your system).
Try to pull Metasploit GIT repo to get some idea about thousands of payloads that do corrupt SP without crashing the host...
Say there's a minor error in a network driver. Yes, it might be exploitable by a smart person. But the error only triggers once a day when a counter rolls over.
Do you really want your box to lock up and panic when this error is encountered, or do you just want your box to keep working.
I'm firmly in the first camp (I'll take lock up and freeze thanks) but 99% of users don't care about a bug like that and just want the box to keep working.
But do you want your box to send silently corrupted data for the next two years? Or would you rather reboot every night, and maybe escalate to your red hat support contract, where someone will then fix the underlying bug (for which you now have crashdumps),
That Redhat support contract won't save you from a bug in a binary blob network driver.
Crashing the whole kernel at the drop of a hat seems like a pretty extreme stance to take as a general policy IMHO. Killing and restarting the driver will usually suffice, although some data may be lost and have to be retransmitted.
As so often, it really depends. Let's say you've just detected that you're going to send incorrect data because you've ended up in an indeterminate state.
If the remote end is going to ignore that data anyways, would it really be such a bad idea to keep running? Do you really want to go down in order to ensure that a remote who's ignoring your data can get correct data to ignore?
Of course you never know what sort of effect the corrupt data is going to have, so it's always hard to make that decision.
Like that issue with libraries linking against objective-c frameworks that has crept up with High Sierra and broke most of the Ruby world: Yes. The usage was incorrect. Yes, forking after threads are launched leads to undefined behavior, yes, knowing about it is a good thing.
But: So far the crashes have been rare in the common use-cases (or they would have been fixed), so High Sierra's change to blow up loudly when it detects the misuse has actually caused a lot of trouble for people where things worked fine before.
To the point where many Ruby developers were complaining about High Sierra "breaking" their workflow and recommending against upgrading.
The new check is totally justified though. The existing forking behavior was wrong and it could have lead to crashes down the line. It didn't though. And now people are forced to fix something that was never an issue to begin with.
It's a fine line to walk and while I generally prefer things to blow up as they go wrong, sometimes I catch myself wishing for stuff to just continuing to work.
On some self-reflection, I come to the conclusion that I want my cake and eat it too.
There rule I work to when I design these type of systems is that if the source of the error is internal you should reset and avoid propagating the error.
Conversely if you receive an error from an external source you should handle it gracefully and reject the bad message.
On the other hand, the mere act of panicking may corrupt data (by virtue of stopping processes). I learned this the hard way when my kernel panicked while I was shrinking a large ext4 volume (the panic was unrelated to the shrinking). It's not just a simple equation like you've claimed.
A panic should stop the processor dead, no data should be corrected as a result.
Data in flight should not be used if you use transactional I/O and therefore will not be used if a write does not complete.
The linux kernel needs to work with both tolerant and non-tolerant systems. Saying it needs to work a specific way that completely breaks real world things is completely naive, and exactly what Linus was railing against.
Well, only that most devices never were tested over months in development kernel ... and it is not possible to do so, with all those million different devices around.
I was writing paper, on a PC, that was like "pip pip pip pip pip"
and then... like half of my paper was gone.. and I was like...
It devoured my paper.
It was really good paper. And then I had to write it again and
had to do it fast so it wasn’t as good. It’s kind of... a bummer.
First things first: Kernels panic and processes crash. If your medical equipment or telco/ISP system can't recover from that then you're in trouble anyway. Why they crash doesn't really matter in that context.
As far as voting machines go, kernel panic sounds waaay better than executing malicious code.
> Imagine a security f*ck up, like Heartbleed, but this time with an option to halt kernels / systems.
IIRC heartbleed didn't allow you to execute code (it allowed you to read more memory than you should have been able to). A better example is every flash player bug ever. Would you rather that thing crashes or executes malicious code? Keep in mind that the malicious code can also shut down your system.
Also keep in mind that we're talking about userspace programs right now. This thread is about kernel bugs. Userspace programs already have the option to ask the kernel to kill them if they misbehave. A lot of them do that (using features like seccomp filter) and many more should. (Chrome and Firefox both use seccomp filter I think.)
Let's say somebody gives you an USB stick and you plug it into your laptop. Which of the following scenarios would you like to see?
1. 0-day in the kernel's USB code. You're part of stuxnet now.
2. 0-day in the kernel's USB code. You're part of stuxnet now. You also get a message that tells you how and where to report the bug that was exploited.
3. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited.
Linus is wrong (it happens). Exploit mitigation techniques aren't debugging tools. They're exploit mitigation techniques. The fact that they also produce useful debugging information is secondary.
This is exactly the kind of thinking Linus is talking about. In 3, I lost my work. Possibly very important work. To most people, being a part of stuxnet, while undesirable, is preferable to losing their work.
And you neglected a scenario 4: nobody is attempting to compromise my machine, but a buggy bit of USB code just crashed my system and took all my work with it.
You never learned at school to save what you are working on often? It's crazy, you are either too old to have required computer for school work or too young to have not lived through years of constant bluescreens.
Both 3 and 4 are mitigated by you saving your document often... it's not so bad, considering it can happen whatever you do.
Nowadays, Word is made to keep saving your change for that reason... They learned and designed it for the worst situation which is a whole system crash. If you can't handle that, well you aren't doing your work well.
Again, passing the buck. You know what I do instead of use your software that crashes all the goddamn time? I use someone else's software that doesn't.
Yes, of course we should save often, have decent backups, etc. But nobody is perfect and shit happens, and it'd be nice if the software you use didn't intentionally make it worse.
The problem is, what actually happened (in a previous commit) was:
The IPv6 stack does a perfectly sensible and legal thing. The hardener code misunderstands the legal code, and causes a reboot.
That it was Linus is worried about -- often it is hard to tell the difference between "naughty" code which can never be a security hole, and genuine security holes.
They should all be fixed ASAP, but making code that previously worked make a user's computer reboot, when it is perfectly fine, is not a way to make friends.
Bugs in the hardening code are obviously bad and annoying but that's besides the point. All bugs are bad and annoying, especially ones that cause a kernel panic. I don't think anybody is going to argue with that.
That's not what Linus said though. What he said is:
> when adding hardening features, the first step should *ALWAYS* be
> "just report it". Not killing things, not even stopping the access.
> Report it. Nothing else.
and:
> All I need is that the whole "let's kill processes" mentality goes
> away, and that people acknowledge that the first step is always "just
> report".
"Not killing things, not even stopping the access." Oh boy.
Step back a bit: when developing a new selinux policy, won't you develop first on permissive mode, and only after it's working without warnings, enable enforcing mode? It's the same thing here: the hardening should be developed first in a "permissive" mode which only warns, and then, after it's shown to be working without warnings, changed to be "enforcing" (in this case however, after some time the "permissive" mode can be removed, since new code should be written with that hardening in mind).
I didn't mean that to sound like I'm in favor of turning the thing on right away.
(Also, the quotes I chose don't really help me make my case but I don't want to edit now since you've already commented on it. His first mail is way worse: https://lkml.org/lkml/2017/11/17/767)
Basically what I'm disagreeing with is that exploit mitigation's primary purpose is finding and fixing bugs. That's just not true. Its primary purpose is to protect users from exploitable bugs that we haven't found yet (but someone else might have).
By first step, Linus just means "for a year or two". Yes it would be nice to put super high security on today, but instead we slowly turn up the setting, from opt in to opt out to forced on, to ensure we don't break anything.
4. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited, but the part of your computer that was supposed to log the message died with the rest of the system, so you never see it and the bug never actually gets reported. Your computer continues to crash randomly for the next few days as an infected computer keeps trying to spread.
Hooold it. Some of those things are not like the others.
--
I pity the engineers working on ventilation machines and the like. Medical devices are insanely hard to get right; that's neck and neck with aviation testing. I'm reminded of SQLite3's "aviation-grade" TH3 testsuite, which apparently has 100% code coverage. Let's be honest; Linux's monolithic design can't really attain that.
I would never use Linux for a medical device. I say this as someone who just happens to only be running Linux on every machine in the house right now (and I have for years, it's just how things have worked out, it's not at all novel or whatever, my point is that I'm totally comfortable with it). I'd use L4 or something instead. In a pinch I'd use a commercial kernel with tons of testing. Maybe I'd even use Minix; I'm quite sure a lot of people in industry are seriously looking at it now Intel have pretty much unofficially greenlit it as a good kernel (lmao).
--
Voting machines, on the other hand; I'd totally use Linux for that, because the security/usage model is worlds apart. Here, I WOULD ABSOLUTELY LIKE FOR THE TINIEST GLITCH TO CRASH THE MACHINE, because that glitch could be malware trying to get in.
The user experience of a voting machine is such that you walk up to it, identify yourself, and push a button. Worst case scenario in this situation is that you do some involved process to ID yourself and then the unit locks up, so you have to redo the ID effort on another unit. That is, for all use cases, not going to be a problem.
(I think that's the first time I've used all caps in years!)
--
Telecom systems... those are also a totally different world. See also: Erlang. In this situation you would likely want a vulnerability to literally sound a klaxon on a wall, but have the system still keep going.
I'm reminded here of an incident where a country's national 3G system was compromised (not the US, somewhere else) by hackers and the firmware of the backend systems was hot-patched (think replacing running binary code - the OS allowed it, it was REALLY hard to even notice this was happening) to exfiltrate SMS messages and cause calls to certain numbers to generate a shadow call (which ignored mic input) to an attacker-controlled number as well.
Telecoms is a classic case of massive scale; nowadays a single telecom switch might be routing thousands of calls through at a time. Yeah you don't want even a single machine to go down. But you DO want VERY thorough debugging, auditing and metrics.
(Which apparently don't exist.)
--
As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.
Erlang's error-handling model is good (and interesting). Th e motto is: "Let it crash".
Each node does not handle errors at all, but PANICs on a fault. It is up to the supervisor (with global knowledge and state) to handle the fault appropriately.
It's good for uptime, but not good for correctness. The main problem is that it is hard to differentiate expected from unexpected crashes. Something like a missing pattern match can lead to a crash and it is very hard to know if the programmer "intended" for a crash to occur in that case or if the missing pattern is a bug.
You can have processes that reboot once every few minutes running for years because people didn't realize they were bugged.
>As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.
Light on details, but the vulnerabilities are disclosed and fixed [0]. ME updates are already available from many OEMs.
Right. But "don't apply the patch!" is sort of circling as well, because (presuming the Blackhat disclosure is workable, it sounds like it will be but fingers crossed) we might be able to play with our MEs.
Many medical devices run Linux. Most (AFAIK) patient monitors run Linux; GE and Philips (the biggest is business) both run on Linux. Those are the devices that keep you alive during surgery, make sure that those who are born too early (I don't know the English term here) are doing ok, monitor you state while you are in ambulance etc.
I'm reminded of a UAV doing the same thing. It ran L4 for low-level control, realtime scheduling, and security, and then virtualized Linux on top of that.
Sounds unbelievably clunky on the surface, then you realize it's a remarkably useful way to abstract everything cleanly.
For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.
PANIC on fault is exactly what you design into the systems.
What you find is that the truly safety-critical portion of the system is running on a microcontroller and the UI (which is not safety-related) can run on Windows or Linux.
Seems like so many in security fail to see the DoS implication.
There is no solution to bugs other than fixing them. And that's what Torvalds and others have been saying: for a security researcher, finding the bug is the end of the job. For developers, that's just the start.
A better way of course is not to halt on assertion, but to limit the scope of any potential problem in a such way, that an assertion could only crash a tiny isolated thing and trigger its restart, possibly not impacting availability whatsoever. You still get your sanity, but also get users happy with a rock solid thing that just works even in a presence of errors.
Erlang VM works this way, if someone's looking for a program that does this in practice. They have the mantra "let it crash". Something higher than you is in a better position to handle your error and restart you to a known good state.
I think a good tradeoff could be that with containers, the individual containers are hardened, whereas the kernel's host OS is not. The host OS doesn't do much except keeping the containers running.
AFAICS, Linus also wants it, but he wants a panic to be preceded by a rather lengthy span with just a warning, allowing the concerned dev to actually fix the error. Essentially he's saying: "take it slow, and don't break user experience."
Due to the aggressive nature of grsecurity, a lot of the assertions it trips on are bogus; either they didn't understand the code they were securing or they changed the rules without properly updating all the affected code. For example, there was a particularly obnoxious panic in the tty layer a few versions back that was entirely the result of this.
Servers: yes, phone/home PC: no. Maybe my dev machine but my wife wouldn't be happy with a kernel panic while writing an email. Like Linus said, this would go unreported with the average user because they just reboot in annoyance.
When the invalid write overwrites some piece of data your application doesn't care about (or more likely some feature in some driver you don't care about). Especially when the trade-off is the web site goes down.
You don't want your machine to start crashing after installing the latest kernel. Or at least, that's the golden rule of Linux development.
If phones start crashing after installing the latest Android update, people won't see this as a security/stability improvement. They'll simply see the new version as buggy and of poor quality.
It's a deliberate trade off. Not panicking results in uncaught exploitation attempts, and panicking will result in crashes where a vanilla kernel would happen to survive.
If an assertion does not hold, you have a real problem. The kernel has in-memory data corruption. Either from buggy code, bad memory, solar radiation, etc.
So if that assertion is in the file system, maybe your kernel should die before it corrupts your data permanently.
Depends. How attached are you to getting work done today?
I too would like to live in a world where sanity preserving assertions have rational consequences. But that world is not this world, and pretending it is won't help you get there.
"Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus)."
If you really cared about security, you'd leave the box unplugged.
"I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours."
Speaking as someone who has done middling large scale production administration, that's not reassuring.
Because Linus hates them? Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.
> Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.
But only because every self-driving car project out there will avoid Linux like the plague.
Instead of self-driving cars maybe crashing from being hacked in some possible future we'll get kernel panics leading to crashes in all possible futures because we'll trigger car crashes on every false positive, because crashing in the face of the unknown is a seemingly acceptable solution to a security risk. Even if the software is running self-driving cars and crashing may mean crashing. In practice, false positives are way more common than exploits and in many user cases they'd rather have 1 computer exploit than 1000 or 10,000 crashes.
You would rather have a self driving car in an undefined state, rather than having it shut down? A random glitch could be just as bad as an exploit; if some chunk of memory gets overwritten and your car decides that the brick wall doesn't actually exist any more, I don't think whether it was an an exploit or not really matters. The occupants end up injured either way.
The undefined state might be in the GPU driver handling the heads-up display, or maybe in the sound subsystem. No need to shut down the system at the kernel level for that. Report the issue to the userland, so that it can decide whether to initiate a safe halt at the sidewalk or emergency lane.
After telling the kernel to shut down immediately you don't have that option any longer.
It's interesting to see this laser focus on a particular kind of user. If you're running Linux on a server, you're a user, but unless you're very irresponsible you would probably rather your programs crash than give away private information. Your interface is to a cluster of machines where individual crashes are probably not that big a deal.
If you're running Linux via Android, you're a user, but mostly you're a user of actively developed apps on top of an actively developed OS, usually pegged to specific kernel versions. Your interface is to that layer on top, and given that its code is written by app developers and hardware vendors who will ship anything that doesn't crash, you probably want security bugs to crash.
It seems to me that the kind of user Linus means when he talks about "the new kernel didn't work for me" is a user of Linux without any substantial layers on top, where kernel updates happen more often than userland software updates, and where individual crashes have a significant impact. In other words, users of desktop Linux.
But I wonder if that focus on desktop Linux really reflects the majority of users. And, if not, perhaps it might make sense to have "hardening the Linux kernel" as the first step if it makes "raise the standard for the layers built on top" the endpoint.
>you would probably rather your programs crash than give away private information
Crashing on a security issue is a good thing for every kind of user. Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation. The problem here is that hardening methods lack the ability to make that distinction.
> Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation.
How do you square this with the reality that "keep on truckin" is generally the path from bugs to security exploits, and has been shown to be over and over in the wild?
Bugs will happen, that's a natural law of computer science, if you keep on trucking over them you will be delivering buggy software that is likely to cause problems. Even if you chase them down and correct them all, your software is still going to have bugs, that's a fact of life.
Should code containing bugs be allowed to run? If the answer is no we must ask ourselves how much software we have today that is completely bug free (that will be 0%).
I still think these proactive approaches are good to disclose possible exploits, but killing processes just because they might be exploitable is a very long shot.
One of the point of Linus is that most of the crashes will not be because of an active attack but because of a possibly latent bug that sometimes appear and might very well be non exploitable. So probably people running server would like to have everything working instead of having random crashes on processes or drivers that are not the core of your service but can affect it. And given the size and complexity of the kernel it would not be strange to have these crashes appear only on certain setups and not necessarily on the ones of the people that are testing it first.
Errors are less likely to be actual exploits in servers too. When the kernel panic is caused by faulty driver failing network hardware, or user land software failures, it can take down multiple servers or all of them at once.
Most of the information in servers is private but not sensitive. You don't want anyone have access, but correct functioning and security warnings are more important than maximum information lock down.
btw. I don't see reason for not having kernel option to turn warnings into kernel panics.
It’s not even desktop users; most desktop users download Ubuntu and never touch anything, on reasonably common PC hardware. Kernel regressions mostly get caught in the Ubuntu betas or testing tracks (e.g. Debian Sid).
The typical user the kernel developers focus on here is a kernel developer: always running the latest kernel with a stable user space. I find it extremely narcissistic that they reject security improvements for billions of devices, for what essentially just makes developers lives easier,
Very pragmatic. He sees software in the overall context of getting a job done with a computer, imperfect though it may be, instead of dying because it was not perfect.
Unlike a segfault from a user space program that indeed merits a 'kill', the kernel should strive at all costs to keep running, since kernel panics are so much more inconvenient.
Just wish that the higher layers of the stack would see it as well, rather than being angry at Torvalds about it and going about how users are dumb sheep.
If “do no harm” is a principle, then the kernel should ensure that no harm is taking place.
If flaws within the kernel allow harm to occur while otherwise normal transactions are occurring then it is absolutely preferable to panic and shut down over allowing that potential harm to occur.
To suggest otherwise, that detected errors that allow harm should be allowed, is pure insanity.
A thought experiment that comes up in Kernel design classes is what should happen if the OS was running the flight-control software for an Airplane you are on? If there was a bug in the kernel, perhaps a double free or a memory leak, what should happen?
A panic would result in the airplane falling to certain doom. But if it were to keep running, it may be a security vulnerability. Being absolutist in either direction of the discussion will lead to absurd scenarios where you would make the wrong decision.
> double free or a memory leak, what should happen
Both offensive and defensive programming is important in safety critical programs and I get your point, but those things you mention don't' happen in safety critical systems.
There is no dynamic memory allocation. RTOS used will support "brick wall partitioning" for memory, processing and other resources. Different systems can run in the same OS but they cant' compete for processing time, locks or memory access. Everyone has been dealt the resources they can have from the start. It's not possible to run out file descriptors, memory if you allocate them statically from the start.
Assertion errors or monitoring errors in safety critical systems usually cause reset or change into backup system. If the program state is large and reset is not safe, retreating to some earlier state (constant backups) is likely.
Kernel is modular. Literally everything can be enabled/disabled.
Aviation has strict regulations and that's why most critical systems have redundant parts. Putting a sigle critical component into plane is stupid in and of itself. Think of simple freezing in high altitude or overheating otherwise.
On the other hand I would rather fly in a plane whos altitude meter shuts down and switches to redundant circuit other than letting it report incorrect values...
You seem to be one of those "security people" Linus refers to. ;)
Harm is relative. You security people think that every single security issue is so important that it doesn't matter what harm mitigating that may cause, it can be done. Well, that is not what Linus thinks.
Kernel may terminate a process because it did something of suspect but doing that may actually cause way _more_ harm.
The philosophy here is that security bugs are just bugs.
No. First, you never know if an invalid access or integer overflow can actually be exploited, and even if, you don't know if it can remotely be exploited. If you run a server hosting sensitive data by all means, use grsecurity but on my home PC where I browse Facebook and send emails, fuck off with your kernel panic.
But as he explains, these are latent bugs. They may or may not be targetted for exploitation yet. Meanwhile hard crashing over them can lead to poor user experience, while throwing up warnings and leading to an actual fix would be better from user PoV, unless of course userspace actually depended on the buggy condition, but that's another discussion.
It's not so black and white. Medtronic uses Linux. Do you want to be the guy who's medical equipment spontaneously reboots because of a bug that wouldn't have otherwise affected anything?
Linux is a modular kernel. I'm not aware of a single thing you can't disable or make modular during config/compile.
I woudn't like to be the guy who's medical equipment killed him by slowly decreasing his oxygen levels due to buffer overflow either. If you're in this kind of business you take responsibility by discovering and fixing bugs which would go unnoticed otherwise. And if the life of your patients really depend on your equipment then having a redundant component within your device is a must.
You mistake "do no harm" for "don't let the user do harm". This is "do no harm" in the sense that Hippocrates said it; your job as a kernel security dev is to not harm the user, just as a doctor's job is to not harm his patient.
"without users, your program is pointless, and all the
development work you've done over decades is pointless.
.. and (then) security is pointless too, in the end."
He tends to get really mad when kernels dev inconvenience user space devs. Perhaps one of the reasons Linux succeeded was because of this fanatical customer focus - if linux is the platform, user space developers are the customers.
Not breaking user space is fundamental to the success of most operating systems today. I read this article I cannot seem to find about Microsoft employees spending months replicating "wrong" behavior in <oldwindows.. win95?> so that applications still ran on <newerwindows... win98? win2000?>. Users don't want to see their applications crashing.
I wish I could find this article again, it's very relevant to your comment.
The person at the other end of the conversation would disagree with this sentiment:
"Thanks. Still, I'd prefer Linus yell at me than other folks trying to do similar work. If I can shield anyone from this abuse, then maybe they won't give up on kernel security development. Digging Linus's actionable feedback out of the ad-hominem attack can be challenging." [1]
I am happy Linus shields Linux kernel from those golden-hearted individuals whose feel-good attitude would allow questionable solutions designed by committee to slip in.
That's a consequence of an "old" issue in the IT security field - security researchers and developers sit at opposite sides of the table, they've got different concerns and agendas.
Pick some security researchers; now tell them to build any nontrivial piece of software; I doubt they'd be able to do it, and if they succeed their software will be full of bugs, including security ones.
Security is part of the correctness and of proper building of the software, so it should be integrated into software development. Security experts can (and should) still exist, but the current state, where the infosec people appear to rule, is pointless - exactly because the same infosec people wouldn't be able to deliver better software than current developers.
I highly regard somebody who can write a software without security bugs; I don't regard as highly somebody who shows me the bugs, but would be unable to write that software at all.
Let's turn the infosec objective: not to undiscover security bugs, but to write software without security bugs. Then we're at the same side of the table.
That's not a fair (or useful) assessment. Obviously, the narrow-minded security people you describe exist, but they're a minority. Many security people are developers who specialized in security, and are very much capable of building software.
The kernel code is question is exactly what you ask for - instead of finding and fixing single bugs, it's a mitigation that prevents all occurrences of a particular class of bugs.
The mitigation does not prevent a particular class of bugs, it prevents a particular class from being exploitable by turning an invisible-but-possibly-exploitable bug to a crash bug. The bug is still there, but now it has a larger impact on most customers.
That's a serious tradeoff, especially (as Linus is complaining) turning a rare bug into a crashbug doesn't allow you to detect and fix it, you need a mode where the bug is logged but the process is not stopped (and might be exploited) so that the bug can be reported, reproduced and fixed.
Of course it's "anecdata" because I've never personally conducted a test of all security consultants for their programming skills.
But the people you tell about (developers who specialized in security) maybe exist in large shops (Google, MS, FB), but many, many security consultants will work for specialized firms that offer security services, but not software development, and vice versa. Take a look at most Defcon/BlackHat talks where a vuln is explained/uncovered/exploited: most such researchers don't pertain to a software development firm, but to independent security firms.
Why should my assessment not be useful? I proposed a very clear solution to what is the problem.
Source: I worked for almost ~10 years in a firm who had both a software development and a security services branch, met tens of security consultants and worked in remediation activities for software issues where the security consultants weren't able to do it.
EDIT:
about the kernel code, I agree with you that such code is a step in the right direction, but I agree with Linus that the "warn" should come before the "kill".
When I was looking to learn about how to systematically make secure software, I did not found all that great actionable information. There is a lot about particular hacks and vulnerabilities, lists of popular vulnerabilities categories etc. There are was one book dealing with architecture and such I found. Development quite clearly is not focus of security research.
A lot of advice, especially that found on blogs, was literally naive and felt like something written by someone who never even seen larger team working.
As a infosec guy who was a software developer, its non-trivial to write actionable general security advice.
There is an entire academic field of study on making network related security blunders hard (lang-sec). It generally boils down to do all your parsing in one spot and a small set of features are evil.
What is really needed is a site where one can pick a bunch of features that your software project has/wants and then it gives semi-tailored advice on what to do, what to watch out for, or that you need to rethink things (eg: rolling your own TLS implementation=world of hurt).
Many developers care about security the same way as security researchers.
Developers that care about design by contract, assertions, having warnings as errors, pedantic flags, get the CI system to break builds on static analyses errors, only allow for unsafe code when confirmed by profile measurements that it actually matters,....
Yet, we are able to deliver working software that fulfills project requirements, amazing!
>It's that the code has been RUN BY USERS for months. If it's been [...] in grsecurity for five years [...] It only means that hardly anybody actually ever ran it.
Subtle burn towards Grsec, I laughed a bit.
In all seriousness, I think Linus is somewhat on the right track. Security Patches should foremostly not break anyone's workflow (maybe except the evil haxor's workflow) and rather print a warning until the exact implications of a full Terminator-mode patch is understood. Because people won't upgrade to kernels that break their workflow and a warning in the kernel log is better than a vulnerable kernel.
It killed printing from iframes completely. Great that they solved the security problem, whatever it may have been. They also broke a major piece of browser functionality that a lot of enterprises rely on fundamentally. Hell, even printing shipping labels from eBay was broken, and so was hilariously Microsoft Dynamics 365. And our own product.
And the first response by Microsoft on this very link? "Won't Fix" and keep a major bit of fundamental browser functionality - printing a document - not working. "Do No Harm".
Because there is a workaround, applications just need to be updated.
"Either use Print Preview instead of print button or if currently using window.print() in javascript change this to document.execCommand('print’, false, null)"
> Because there is a workaround, applications just need to be updated.
What actually happened: people reverted the patch. In the real world, you can't expect timely or even correct response from vendors you rely on. It sucks, but it's how it is.
Updated away from web standards to false and null and random string. And then they made window.print work again.
Additionally what guarantee was there that execCommand(magicString) where magicString is `print` wouldn’t be removed in the next security “fix” because the security fix should have broken this too as it did window.print? After all, it seems to do the same thing. Answer: none at all and then you’re back to square one.
Some security folks tend to have tunnel vision. They are focused on their own little patch and the first instinct is to control and shutdown with zero concern for usability. job done. bye. But that's too self serving and lazy.
And if that's what they want fine, release your own app or distro and let people who need/desire that level of security choose security over everything else without trade-offs.
The worst thing is a kind of entryism and trying to impose yourself on people operating under different constraints. That's why you always need a strong manager to balance interests and this is the role Linus is playing. More often that not with this kind of imposition something comes out of the blue, and you are left squandering hours to fix things and get back up and running.
1) failing loudly is better than failing silently. A memory corruption issue (or a bad refcount, etc.) is not a benign issue that only becomes relevant under carefully crafted exploit conditions. You need the carefully crafted exploit to get the system back into an attacker controlled state (I.e. code execution); by itself (with non-malicious inputs, usually something random or slightly atypical — enough to not have been noticed yet, but typical enough that some program does it) the system is likely to either panic immediately (same result as with pax) or to corrupt some memory, in which case you will have a lot of strange behaviour to track down later (users will probably blame them on hardware or on their user space, so you might never see them. for example a recent OSDI paper showed that ext3/4 had several real world data corruption bugs. If these aren’t as frequent as the recent bcache issues, no one notices).
2) When I was doing research projects (into memory defenses on the kernel) about 3 years ago, there was no (commonly used, that I saw) automated testing infrastructure in the kernel. This makes catching regressions, especially in drivers for rare hardware, hard to catch. While tests aren’t a panacea, i think Linux overestimates what fraction of problems Code reviews will catch.
3) the “don’t break user space” strategy is already failing. Every mainstream distribution and embedded vendor stays on an old kernel branch. Big deployments do staged rollouts and extensive burn in tests. This isn’t just because the kernel, but because of extensive abreaking changes everywhere (compilers, standard libraries, etc. all need to change sometimes).the last time this happened, IIRC it was some audio bug in a strange configuration. In my experience, running a non standard Linux audio confit causes countless breakages, so an additional one in the kernel that might save my personal data from being exfiltrated is worth it. Most users have average (and therefore well tested) setups, which means thy won’t see breakages as often.
Perfect software doesn’t exist, and even MSFT backed off maintaining religious backwards compatibility (note that Microsoft’s approach was not to flame at developers and hinder new development, but through extensively building compatibility shims. Often, these came with trade offs strongly in favours or security, e.g. UAC).
Breaking user space is ok; users already expect breakage, and the cost of the additional breakages is low (to users and to society as a whole) compared to the cost of security breaches [citation needed, but Linux kernel security is relied on in a lot of places].
So one of Linus' main points in this series of posts is that failing loudly is actually not always better than failing silently or quietly, and it's really annoying when people come in making that assumption without thinking. This is also something that he is constantly repeating and ranting about, and it's arguably one of the reasons why Linux is so successful.
Think about a smartphone - do most users want it to crash and reboot, even if some error (which could end up being a security issue) occurred? The answer is no, absolutely not. The crashing and rebooting itself isn't really that helpful. Reporting the bug to the Linux developers _would_ be helpful.
Some people do want the frequent crashing behavior and that's okay, but it's not okay to make that decision for everyone.
Also, users might expect minor breakage if someone somewhere makes a mistake, but that doesn't mean it's okay. That's like saying if someone always washes their hands before eating, it's okay if they get sick, because they were expecting that they might get sick.
> Breaking user space is ok; users already expect breakage, and the cost of the additional breakages is low (to users and to society as a whole)
Which is why everyone loves rolling releases so much that Windows 10's forced upgrades are universally praised and Linux Desktop has a dominant market share.
If your security patch kills users' buggy processes or even crash their systems then you are a «bad security person». Please report the bad access first so users and developers of their software have time to fix the bug. Upgrades disabling users' software are a big no-no. After all security is meaningless for a non-working system.
Linus has been pounding on the bench saying "Don't Break Userspace" for over two decades now. This is really just another manifestation of that same policy. Some 'security' folk apparently believe they occupy a special place in which they are exempt from this policy. Linus is showing them they're wrong; evolving the kernel into a minefield in the name of security is not the way to world domination.
And I think a big part of Linus's point is that if you break userspace for security now people will be less likely to upgrade their kernels in the future.
So they'll be secure from a hypothetical possible bug today, and completely vulnerable to real demonstrated security exploits in the future.
I’m so glad that backwards thinking concepts like this are dominant, otherwise we might actually have secure software!
Think about it, an open-source OS is choosing backwards compatibility over security. This would have caused quite the stir in the 90’s Linux community.
Security is meaningless if your system doesn't work. Makes sense for me. Or your car is forbidden from starting after an upgrade because you might run over someone tonight. It's a trade-off.
They're choosing backwards compatibility over bad security (kill everything instead of reporting and fixing bugs). Applying duct tape everywhere instead of fixing things can hardly be desired.
And as a side note, backwards compatibility is what made windows succeed.
Do you want your system to kernel panic every time the security code gets a false positive about a problem? The security code is just as likely to have bugs as the software it is securing.
This is the most insightful Linus writeup yet. His others are good too, but this just hits the spot. Great!
Funny note, this post could have been a textbook sort of material. At the end he even says please. The only thing that breaks it is the reference to touching oneself :)
No security person is likely to argue that DOS attacks aren't a security issue. They get CVEs all the time! A patch that denies the user the ability to use the service introduces a security issue. Bad usability IS a DOS attack.
With respect to https://lkml.org/lkml/2017/11/21/356 and this thread I just want to get #CVEs on all the things so we can figure out the scope of the problem, and then look at what we need to do to "fix" "it" (assuming "it" is a real problem, dunno without data).
Lots of drivers on Windows, OS X and Linux run in kernel space simply because kernel-to-user-and-back context switches are expensive and so kill performance.
I believe the exceptions are printer and scanner drivers (these run in user-space CUPS in OS X/Linux), some filesystem drivers (basically, FUSE-backed) and cheap-ish USB drivers.
The logic behind why it is done like that I get. Just wondering as You said is it possible to push at least the most bug-prone and exploitable ones to user-space
I don't see how you can convert a kernel-space driver to a user-space one without significant rewriting, and in some cases it may not be possible at all.
What about some abstraction/interfacing layer/driver that would take care of exposing some kernel functionality an average driver needs and provide additional validation?
The "most bug-prone an exploitable" dimension is a bad one. There's probably some correlation, but it is not a good way to look at the differences.
You can push into userspace the software that work some data into some low level data. You can't push into userspace the IO of that low level data to the hardware. If your driver is mostly interpreting complex data before IO, you can push most of it into userspace, but if it is really doing IO (or calculations are interspersed with IO), you can't.
Since WDDM, a good portion of the display driver in Windows is in user space, as are sound and network drivers. This is how you can upgrade your graphics driver without rebooting or even losing your open windows, and why network and sound driver crashes don't usually bluescreen. It is probably the biggest reason why Windows doesn't crash anywhere near as often as it used to.
Because the primary focus should be "debugging". The primary focus
should be "let's make sure the kernel released in a year is better
than the one released today".
Is he starting to sound like illumos engineers or what? Better late than never, but it took him long enough.
do you want to know why your (random big company) does not do software development that well. when was the last time you saw an email like that from the chairman of the board to all employees? with Please in it. and long winded explanations?
This is what I was thinking when I was reading. Linus has been thinking carefully about this for a long time. He has taken this piece of software and he has nurtured it for years. He has protected it. He made sure that it is coherent and maintainable. He has put forward as set of guiding principles.
I haven't seen this in the companies that I have worked for, middle and upper management don't have any insight over the products they are building. They just care about dates and getting projects done. There are not insights. There is no long term thinking. There is no awareness of the technical debt that is building up. Their only solution is to throw more money (and people) to the projects.
I 100% agree with Linus here, buuuut Linus doesn't have to make money for his company, and isn't personally responsible for the salaries of people in the company. He doesn't have to sit in meetings and explain why sales are down 10% this quarter, or keep assuring investors that their money is safe. Those upper-management people do solve problems, its just that they're solving problems in an environment that isn't based on logic, and is fundamentally unfair.
The first thing I did when starting to read Linus' answer was to Ctrl-F for badwords. I am really happy he's blossomed from that teenage angst colorful name-calling era.
I interpreted his response in practical terms as essentially being the following. Patch set merge 1 has "report" as default and "kill" as a non-default option. Patch set merge 2 has "kill" as default and "report" as a non-default option. Patch set merge 3 removes support for "report". This way we have the best of both worlds: we eventually reach the thing that actually adds real security benefit, which makes security folks happy. And we don't break everybody's computers immediately, allowing time for the more obvious bugs to surface via "report", which makes users and developers happy. Seems like a reasonable process to me.