Linus Torvalds: “Do No Harm”

zx2c4 · on Nov 22, 2017

I wrote the email that prompted this quite civil response. I'm very pleased with the outcome, because I think this clear statement of his position is a lot more useful for people to work with, rather than just assuming Linus hates security or something.

I interpreted his response in practical terms as essentially being the following. Patch set merge 1 has "report" as default and "kill" as a non-default option. Patch set merge 2 has "kill" as default and "report" as a non-default option. Patch set merge 3 removes support for "report". This way we have the best of both worlds: we eventually reach the thing that actually adds real security benefit, which makes security folks happy. And we don't break everybody's computers immediately, allowing time for the more obvious bugs to surface via "report", which makes users and developers happy. Seems like a reasonable process to me.

zaarn · on Nov 22, 2017

Though, I think that the time between PSM1 and PSM2 will be significant. Usually default options are changed once basically all distros compile with another option without widespread breakage. And once no LTS kernel with PSM1 is supported, you merge PSM3.

Might take years but atleast the airplanes keep flying instead of crashing their computers and consequently themselves.

zx2c4 · on Nov 22, 2017

> Though, I think that the time between PSM1 and PSM2 will be significant.

Indeed you're probably right there. Fortunately security-focused distributions and individuals would be able to change the defaults in the interim.

shandor · on Nov 22, 2017

I think the most important point here is that with those different patch sets, the more security-conscious users/companies get to have the properly hardened version immediately into use, rather than running the "only report" versions for what might be years, as you say.

Granted, I'm looking at this from a perspective where our company compiles our own kernel for use in embedded devices, so we can have whatever patch sets supported we want. But I think that's much better than everyone having to use the "report only" patches for years, or even worse, the features never getting into the kernel in the first place.

mpe · on Nov 22, 2017

Depending on the bug class there may be users who never want "kill" to be the default option.

microcolonel · on Nov 22, 2017

The other important "users" are the developers and drive-by-developers of those user space processes which may accidentally trigger these bugs. These folks are highly likely to be able to resolve these issues if they are reported [in a way which is visible to them].

Counterpoint: most developers and users are not actively following their logs at any level, and maybe something should be done to make it more common. Possibly stderr logging of such errors in libc (or some other commonly used library which already sometimes logs errors on its own, like glib). 3:- )

jlgaddis · on Nov 22, 2017

It would be awesome if this could be set by a sysctl, instead of setting a kernel option and recompiling the kernel.

edoceo · on Nov 22, 2017

Patch Set Merge 1.5?

perlgeek · on Nov 22, 2017

I interpreted the mail as (among other things) "killing processes is not acceptable behavior", or at least not acceptable default behavior.

mtgx · on Nov 22, 2017

Is that really a position Linus has maintained for a long time? Because I got the feeling Linus really just hated anything to do with security.

It's only in more recent years when the automotive and IoT industries have started to get involved in the Linux Foundation and them asking for more security features that he seems to have tried to find ways to "compromise" with security people.

forgottenpass · on Nov 22, 2017

You're going to get a very different perspective depending on if you actually lurk lkml or just read when something gets linked by socail media or "news" that needs to wrap it in a hot take to make lookies loos interested in lkml.

azernik · on Nov 22, 2017

It's only civil because it's a follow-up; usually it's only his first email in a thread that follows the classic (notorious?) Torvalds style.

For those who want it, here's his first email in the thread, profanity and all: https://lkml.org/lkml/2017/11/17/767

ddalex · on Nov 22, 2017

Nope, his first email in the thread was this:

https://lkml.org/lkml/2017/11/17/423

Where he's quite civil and explains quite clearly why he won't accept the patch, and what should happen for the patch to be accepted. The Kees's reply insisting on the merge is what led to the profanity-laden email, and frankly I understand that (not condone, but understand from a human-reaction point of view) - how many times does one have to iterate his viewpoint to others to make himself heard?

avinassh · on Nov 22, 2017

Nope, this is the first email - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01357.html

Entire thread links from start:

1 - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01325.html

2 - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01357.html

3 - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01368.html

4 - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01636.html

5 - http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01701.html

FussyZeus · on Nov 22, 2017

Normally I'm not a fan of Linus, but this:

> IT IS NOT ACCEPTABLE when security people set magical new rules, and then make the kernel panic when those new rules are violated.

> That is pure and utter bullshit. We've had more than a quarter century _without_ those rules, you don't then suddenly walz in and say "oh, everbody must do this, and if you haven't, we will kill the kernel".

> The fact that you "introduced the fallback mode" late in that series just shows HOW INCREDIBLY BROKEN the series started out.

This makes perfect sense and outlines a real problem with security patching in general.

digi_owl · on Nov 22, 2017

Sadly this kind of magical security thinking have many proponents higher up in the Linux stack, and they have the backing/support of GKH. Thus i worry what will happen the day Linus give up the reins.

FussyZeus · on Nov 22, 2017

I honestly don't think Linus will give it up until he's in a box. He lives and breathes the kernel.

lima · on Nov 22, 2017

Background: the "kernel self protection project" (KSSP) recently upstreamed the Grsecurity/PAX reference counting implementation which prevents a certain class of security bugs from being exploited.

Grsecurity is a security hardening patchset for Linux that makes deliberate trade-offs in favor of security, sacrificing availability if necessary. This, aside from the political issue, is the main reasons why it's hard to upstream it. Linus has called some of their mitigations "insane" before precisely for that reason. Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

Unfortunately, Grsecurity/PAX is not (and probably won't ever be) involved in the KSSP project, and the KSSP developers do not understand the code nearly as well as the Grsecurity team does. This lead to a situation where the new code caused a crash that they weren't able to fix in time, so they disabled the feature in the last minute.

I've been using Grsecurity for years until they stopped making it publicly available, and I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours.

Grsecurity/PAX have invented many of the modern exploitation mitigations, probably second to none. Some have even been implemented in hardware. Their expertise in building modern defenses is astonishing (their latest invention, the control flow integrity mechanism RAP, is a work of art).

Linux could be the most secure kernel, instead, it's fallen way behind Windows - which has much better defenses than Linux nowadays thanks to Microsoft's ongoing battle with rootkit writers. Go figure.

If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.

phkahler · on Nov 22, 2017

>> This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

Let's rephrase that. This is exactly what you want if you care only about security, or care about security above everything else - including your system running at all. People run software for reasons, and they need it to keep running for those reasons. The security folks are not really qualified to evaluate the security risks against all the reasons for all the people running Linux.

silly-silly · on Nov 22, 2017

No, they really are. Don't confuse doing it poorly with the ability to do it.

sandGorgon · on Nov 22, 2017

> From a security standpoint, when you find an invalid access, and you mitigate it, you've done a great job, and your hardening was successful and you're done. "Look ma, it's not a security issue any more", and you can basically ignore it as "just another bug" that is now in a class that is no longer your problem. So to you, the big win is when the access is _stopped_. That's the end of the story from a security standpoint - at least if you are one of those bad security people who don't care about anything else. But from a developer standpoint, things _really_ are not done. Not even close. From a developer standpoint, the bad access was just a symptom, and it needs to be reported, and debugged, and fixed, so that the bug actually gets corrected.

As a developer, I do want the report. But if you killed the user program in the process, I'm actually _less_ likely to get the report, because the latent access was most likely in some really rare and nasty case, or we would have found it already.

I dont think Linus has an invalid point there. Taken from his previous thread - https://www.spinics.net/lists/kernel/msg2540934.html

> Don't bother with grsecurity. Their approach has always been "we don't care if we break anything, we'll just claim it's because we're extra secure".

He is worried that grsecurity does not play nice with the kernel in a way of "let's make security hardening obsolete by fixing bugs in the kernel".

jbangert · on Nov 22, 2017

Fixing all memory corruption bugs is infeasible without fundamentally changing the way Linux is developed. There is so much code (and it’s being added to, changed, etc.) written by humans that make mistakes.

There will always be some bugs that are in between being discovered (by someone, maybe malicious, maybe not), and being fixed. How else do you prevent against vulnerabilities in that stage?

banachtarski · on Nov 22, 2017

Linus' response is that calling it an infeasible problem is a cop-out. The right way to go about it is to fix them all, incrementally if need be, and not break userland in the process.

unityByFreedom · on Nov 22, 2017

These comments sound analogous to real world security and societal issues. Like, the desire to increase army size and addressing the underlying issues.

One is a short term solution, the other long term.

fsloth · on Nov 22, 2017

I think given the quantity of our planetary computation infrastructure Linux runs, it's very much a real world issue.

ddevault · on Nov 22, 2017

>If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.

It's more that Grsecurity is working against everyone else. They want to pretend the GPL works in a way that it doesn't so that they can sell their patches. Then they make threats to people who say "that's not how the GPL works" and distribute their patches in accordance with how the GPL actually works. That's not how kernel development is done. I'd rather have an insecure kernel than their bullshit.

tptacek · on Nov 23, 2017

It's their work. We're not entitled to it.

pcwalton · on Nov 23, 2017

Bruce Perens, however, is entitled to not be the victim of Spender's legal harassment for exercising his First Amendment rights to disagree [1].

I have no dog in this fight whatsoever; I don't know anybody involved. But in general I have little sympathy for people, however talented, who waste taxpayers' money with bogus legal action.

[1]: https://thenewstack.io/open-source-pioneer-bruce-perens-sued...

ddevault · on Nov 23, 2017

Actually, we are. That's how the GPL works.

tptacek · on Nov 23, 2017

No, it's not. The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing. If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.

pcwalton · on Nov 23, 2017

> If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.

That would be a work for hire, and it is not the same thing as developing patches independently and distributing them with extra terms, because there is no distribution involved.

> The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing.

That would be true if the patches were not derivative works of the Linux kernel in a legal sense. I'm no lawyer, but that seems contrary to the plain meaning of "derivative work".

Bruce Perens' argument is persuasive to me: https://perens.com/2017/06/28/warning-grsecurity-potential-c...

As is Linus Torvalds' ("kernel patches clearly _are_ derived works"): http://yarchive.net/comp/linux/gpl_modules.html

dctoedt · on Nov 23, 2017

> That would be a work for hire ....

Not necessarily, in fact perhaps not even usually.

1. The consulting contract might or might not provide for the client to own any work product created. Many such contracts provide that the client will own only the specific end product, while the consultant retains ownership of any reusable "Toolkit Items."

But what if the contract is silent about ownership of consulting work product?

2. As to copyright: Under U.S. copyright law, the default mode is that IF: An original work of authorship is created outside an employer-employee relationship, THEN: The copyright is owned by the individual author (or jointly by multiple co-authors) UNLESS: A) the work of authorship falls into one of nine specific statutory categories, and B) the parties have expressly agreed in writing, before the work was created, that it would be a work made for hire. [0] [1]

3. Any patentable inventions would be owned by the inventor(s) unless they were employees who were "hired to invent" or "set to experimenting," in which case the inventions would be owned by the employer; so far as I recall, this doesn't apply in the case of outside-contractor consulting projects — the client would not own any resulting inventions unless the contract specifically said otherwise. [2]

[0] https://www.law.cornell.edu/uscode/text/17/201 (ownership of copyright)

[1] "A 'work made for hire' is—(1) a work prepared by an employee within the scope of his or her employment; or (2) a work specially ordered or commissioned [A] for use as a contribution to a collective work, [B] as a part of a motion picture or other audiovisual work, [C] as a translation, [D] as a supplementary work, [E] as a compilation, [F] as an instructional text, [G] as a test, [H] as answer material for a test, or [I] as an atlas, if the parties expressly agree in a written instrument signed by them that the work shall be considered a work made for hire. [¶] For the purpose of the foregoing sentence, a 'supplementary work' is a work prepared for publication as a secondary adjunct to a work by another author for the purpose of introducing, concluding, illustrating, explaining, revising, commenting upon, or assisting in the use of the other work, such as forewords, afterwords, pictorial illustrations, maps, charts, tables, editorial notes, musical arrangements, answer material for tests, bibliographies, appendixes, and indexes, and an 'instructional text' is a literary, pictorial, or graphic work prepared for publication and with the purpose of use in systematic instructional activities." From https://www.law.cornell.edu/uscode/text/17/101

[2] See the annotated flowchart at http://www.oncontracts.com/docs/Who-owns-an-employee-inventi... (self-cite).

ddevault · on Nov 23, 2017

Right, but they do distribute modified kernels, and we are therefore entitled to their work.

eugeneionesco · on Nov 22, 2017

You're spreading FUD.

I suggest you educate yourself on the reasons grsecurity patches are no longer public anymore.

ddevault · on Nov 22, 2017

You're welcome to present an argument for your case.

amelius · on Nov 22, 2017

> Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

I'd also like my kernel to halt whenever an assertion does not hold, for the sake of keeping my sanity; not just for security.

Why would you not want this?

payne92 · on Nov 22, 2017

For the same reason people drive with their “check engine” light on: It’s frequently better to have a working system (i.e. “I’m late for work”), than to chase an indicator that may not represent a real problem (an actual security intrusion).

Slavius · on Nov 22, 2017

I can't think of single useful piece of software nowdays that is exposed to public and can't run in active-active load balanced or clustered scenario. If your kernel/system/userland-app misbehaves it simply needs to be shut down, reported and examined. It might have been some random memory block the last time your app made an buffer overflow, but it could as well be the stack pointer next time...

makomk · on Nov 22, 2017

Remember, we're necessarily just talking about servers here; every single hospital has mission-critical client machines that cannot go down and obviously those aren't load balanced or clustered. (Though mostly they seem to be running Windows.)

TickleSteve · on Nov 22, 2017

For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.

PANIC on fault is exactly what you design into the systems.

digi_owl · on Nov 22, 2017

So why is not the world running on C64s?

TickleSteve · on Nov 22, 2017

The world does run on microcontrollers that have roughly the same processing capability as a c64...

The vast majority of processors sold are not i5 or i7 level but microcontrollers.

mjw1007 · on Nov 22, 2017

Web browsers are, for practical purposes, exposed to the public. Linux doesn't run only on servers.

Slavius · on Nov 22, 2017

So what happens when your browser crashes? I experience that on a regular basis. Id' rather have my browser crash/killed instead of slowly overwriting my filesystem buffers or corrupting my stack pointer... Other than that browser are multi-thread/process applications. Usually only a single tab or a plugin crashes unless core browser process is affected. Most users would accept the trade off between crashed browser and infected/corrupted system.

mythz · on Nov 22, 2017

> Most users would accept the trade off between crashed browser and infected/corrupted system.

Most users are using computing devices a means of getting stuff done. They don't want to spend any energy thinking about how their software works, they want their devices to be invisible, which they use to run their Apps uninterrupted. The trade-off is whether to let Apps continue running vs hard crashing and taking down all the work they've done and all the mental energy and focus invested up to that point. If their Apps frequently crash most users aren't thinking, well I'm super glad the hours I spent on this paper I'm working on is now lost, the phone calls to my loved ones or movie I'm watching are abruptly terminated because someone's policy on hard crashing when a bug is found has been triggered. Their preferences and purchasing power are going to go towards non user-hostile devices they perceive provide the best experience for using their preferred Apps without any need for pre-requisite knowledge of OS internals.

There's not a single computing device that frequently crashes as a result of security hardening that will be able to retain any meaningful marketshare. Users are never going tolerate anything that requires extraneous effort on their part into researching and manually applying what needs to be done to get their device running without crashing.

Slavius · on Nov 22, 2017

Apps are supposed to keep their state either by saving your work regularly to persistent media or keeping your data off-client. We're living in 21st century in a cloud era FFS.

Keep running your app although integrity corruption within the application happened is putting user data at risk. IMHO an application that corrupts 3 days long presentation file save is to every user more frustrating than the one that crashes due to error leaving you with 5 minutes of unsaved changes lost.

Microsoft have invented "Application Recovery and Restart" exactly for this purpose.

mythz · on Nov 22, 2017

> Keep running your app although integrity corruption within the application happened is putting user data at risk.

If user data is continually backed up to a remote site it's not going to be at risk from a local bug is it? Bugs exist in all software, Users are going to be be more visibly frustrated from their Apps frequently crashing then the extremely unlikely scenario where a detected bug corrupts their "3 days long presentation". They're going very unhappy if the cause of their frequent data loss was due to a user-hostile setting to hard crash on the first detectable bug.

> Microsoft have invented "Application Recovery and Restart" exactly for this purpose.

From Microsoft website:

> An application can use Application Recovery and Restart (ARR) to save data and state information before the application exits due to an unhandled exception or when the application stops responding.

- https://msdn.microsoft.com/en-us/library/windows/desktop/cc9...

i.e. restarting Apps due to "unhandled exception or when the application stops responding" in which case the App is in an unusable state and ARR kicks in to try auto recover it for minimal user disruption. The focus on providing a good UX, not a miserable crash-prone experience where users use their devices in fear that at anytime anything they're working on can be terminated abruptly without warning.

Slavius · on Nov 22, 2017

You clearly have limited view on application bugs. Let me elaborate a bit on bugs causing application dissatisfaction and UX frustration without crashing much, much worse than a simple error message along the lines: "OS has terminated application X because it has performed an illegal operation."

Data corruption - reading or writing corrupted data - files cannot be read, saved files get corrupted, API calls from/to external applications/systems fail or pass incorrect data Rendering problems - corrupted images, incorrect colors, improper content encoding, visual stuttering, audio deformation, audio skipping Input/output lags - unregistered kaystrokes, missed actions and responses to external events, mouse stuttering and misbehavior Improper operation - inconsistent results - repeated rendering yields different results (html), formulas/calculation results in data is inconsistent (excel, DWH) Access violation - access gained to invalid or protected areas - unprivileged access, license violations, access to areas protected by AAA, data theft (SQL injection, database dumps)

and others. If I figure out the application I'm using (web-browser) allowed a hacker to steal my data he would not have otherwise access to I would be more pissed off than if it crashed and I found an error about it in system log.

kiriakasis · on Nov 22, 2017

the standard windows user will not read the system log.

some people just use computer to do stuff to them there is little difference between "i lost my work because of a bug" and "i lost my work because of a security policy"; from a UX point of view both of them are the developer fault for releasing inadequate software.

sctb · on Nov 22, 2017

> You clearly have limited view on application bugs.

Please leave out the uncivil swipes.

https://news.ycombinator.com/newsguidelines.html

klibertp · on Nov 22, 2017

Meta: Who, and why, flagged this comment? What rule exactly Slavius breaks here?

On topic: I can't recall now the details, but I read a paper once about a system which had no shutdown procedure at all, the only way to exit it was to crash it somehow or just shutdown the computer. The system made sure to save everything often enough and made sure to store the data in ways which allowed for restoring possibly corrupted parts of it on the next startup. This design produced a very resilient architecture which worked well for that use case.

The paper was from '80s or '90s, so it's not like we need to be in 21st century to design that way. I'll try searching for the paper later.

cesarb · on Nov 22, 2017

You might be thinking of KeyKOS, and of the anecdote which can be found at https://lists.inf.ethz.ch/pipermail/oberon/2010/005734.html (it should also be at the EROS homepage, but it's down for me at the moment).

See also: "Crash-only software" https://lwn.net/Articles/191059/

klibertp · on Nov 22, 2017

Yes, exactly this! Thank you.

jessaustin · on Nov 22, 2017

The flagger probably was uncomfortable with "FFS". After all colorful expression is bad for HN. b^)

What you're talking about seems like crash-only with Erlang/OTP.

klibertp · on Nov 22, 2017

It's similar in effect, but Erlang's ultimate response to the errors is redundancy instead of trying to salvage whatever was left by the process that crashed. I think the transparent distribution of Erlang nodes over the network is what enables Erlang's "let it crash and forget it ever ran" approach. Joe Armstrong said that they want Erlang to handle all kinds of problems, up to and including "being hit by a lightning" - so I think hardware redundancy is the right path here.

The OS[1] I've been talking about was primarily concerned with a single-machine environment, which resulted in slightly different design.

[1] https://en.wikipedia.org/wiki/EROS_%28microkernel%29

jhasse · on Nov 22, 2017

> or corrupting my stack pointer...

in that case, it will crash with a SIGSEGV sooner or later anyway

Slavius · on Nov 22, 2017

...or is being remotely exploited and it silently succeeds. Who wants that?

jhasse · on Nov 22, 2017

That is very unlikely. Crashing would happen 100% of the time though. Most people want that trade-off (meaning: If their browser would crash, they would switch to another one, even it was less secure).

TickleSteve · on Nov 22, 2017

Stack pointer manipulation is the entry point for an extremely large subset of security issues.

Slavius · on Nov 22, 2017

Corrupting SP is part of almost every exploit and I can guarantee you that it is very likely (going to cause harm on your system). Try to pull Metasploit GIT repo to get some idea about thousands of payloads that do corrupt SP without crashing the host...

jhasse · on Nov 22, 2017

Yes, but how many of all cases of corrupted stack pointers are exploits?

jessaustin · on Nov 22, 2017

Why would that matter? We're not trying to be secure against random cosmic rays. We're trying to be secure against attackers.

http://wondermark.com/406/

jhasse · on Nov 23, 2017

It matters because we're talking about letting the browser crash on all cases.

> We're trying to be secure against attackers.

We also want a browser that doesn't crash.

mcguire · on Nov 22, 2017

Never had a single problem take down all of your instances at once, eh?

muppetman · on Nov 22, 2017

Say there's a minor error in a network driver. Yes, it might be exploitable by a smart person. But the error only triggers once a day when a counter rolls over. Do you really want your box to lock up and panic when this error is encountered, or do you just want your box to keep working.

I'm firmly in the first camp (I'll take lock up and freeze thanks) but 99% of users don't care about a bug like that and just want the box to keep working.

jbangert · on Nov 22, 2017

But do you want your box to send silently corrupted data for the next two years? Or would you rather reboot every night, and maybe escalate to your red hat support contract, where someone will then fix the underlying bug (for which you now have crashdumps),

AndrewDucker · on Nov 22, 2017

I'd want it to log that it's going wrong, and report that so that it can be fixed.

marcosdumay · on Nov 22, 2017

What data exactly is being corrupted?

Fail fast is a great philosophy for end-user software. But it is not that strictly good for middleware, and is almost certainly wrong for a kernel.

azernik · on Nov 22, 2017

If you're a desktop user, or a sysadmin without said support contract, you want the former.

jandrese · on Nov 22, 2017

That Redhat support contract won't save you from a bug in a binary blob network driver.

Crashing the whole kernel at the drop of a hat seems like a pretty extreme stance to take as a general policy IMHO. Killing and restarting the driver will usually suffice, although some data may be lost and have to be retransmitted.

holri · on Nov 22, 2017

I want both. Panic in a test/development kernel, do not panic in a production environment.

TickleSteve · on Nov 22, 2017

It the opposite...

you should panic in a production environment and reset the state of the machine (which has become indeterminate).

The correctness and validity of the data >> uptime.

pilif · on Nov 22, 2017

As so often, it really depends. Let's say you've just detected that you're going to send incorrect data because you've ended up in an indeterminate state.

If the remote end is going to ignore that data anyways, would it really be such a bad idea to keep running? Do you really want to go down in order to ensure that a remote who's ignoring your data can get correct data to ignore?

Of course you never know what sort of effect the corrupt data is going to have, so it's always hard to make that decision.

Like that issue with libraries linking against objective-c frameworks that has crept up with High Sierra and broke most of the Ruby world: Yes. The usage was incorrect. Yes, forking after threads are launched leads to undefined behavior, yes, knowing about it is a good thing.

But: So far the crashes have been rare in the common use-cases (or they would have been fixed), so High Sierra's change to blow up loudly when it detects the misuse has actually caused a lot of trouble for people where things worked fine before.

To the point where many Ruby developers were complaining about High Sierra "breaking" their workflow and recommending against upgrading.

The new check is totally justified though. The existing forking behavior was wrong and it could have lead to crashes down the line. It didn't though. And now people are forced to fix something that was never an issue to begin with.

It's a fine line to walk and while I generally prefer things to blow up as they go wrong, sometimes I catch myself wishing for stuff to just continuing to work.

On some self-reflection, I come to the conclusion that I want my cake and eat it too.

TickleSteve · on Nov 22, 2017

There rule I work to when I design these type of systems is that if the source of the error is internal you should reset and avoid propagating the error. Conversely if you receive an error from an external source you should handle it gracefully and reject the bad message.

__david__ · on Nov 22, 2017

On the other hand, the mere act of panicking may corrupt data (by virtue of stopping processes). I learned this the hard way when my kernel panicked while I was shrinking a large ext4 volume (the panic was unrelated to the shrinking). It's not just a simple equation like you've claimed.

TickleSteve · on Nov 22, 2017

A panic should stop the processor dead, no data should be corrected as a result. Data in flight should not be used if you use transactional I/O and therefore will not be used if a write does not complete.

__david__ · on Nov 22, 2017

That's fine in theory, but it didn't stop my disk from being corrupted. If the computer hadn't panicked, my data would still be available.

TickleSteve · on Nov 23, 2017

its not theory... your system wasn't designed to be fault-tolerant in that manner.

Systems that matter, are.

__david__ · on Nov 23, 2017

The linux kernel needs to work with both tolerant and non-tolerant systems. Saying it needs to work a specific way that completely breaks real world things is completely naive, and exactly what Linus was railing against.

kiriakasis · on Nov 22, 2017

i use linux mainly to write latex these days, i don't want my kernel to panic, I do want my machine to stay operative and not corrupt my work.

The kernel (for my intended usage) should intentionally panic only if there is a risk of corrupting my .tex files.

snvzz · on Nov 22, 2017

Absolutely panic in a production environment.

Potential data corruption is far worse, so is a potential security compromise.

eeZah7Ux · on Nov 22, 2017

It really depends on the use case!

E.g. performing industrial control automation or airplane rudder control in a completely segregated network.

You want control over these tradeoffs, not hardcoded behaviors.

snvzz · on Nov 23, 2017

Linux is ill suited for that.

Look at seL4, minix3, eChronos instead.

hutzlibu · on Nov 22, 2017

Well, only that most devices never were tested over months in development kernel ... and it is not possible to do so, with all those million different devices around.

huhtenberg · on Nov 22, 2017

> Why would you not want this?

    I was writing paper, on a PC, that was like "pip pip pip pip pip" 
    and then... like half of my paper was gone.. and I was like... 

    It devoured my paper.

    It was really good paper. And then I had to write it again and 
    had to do it fast so it wasn’t as good. It’s kind of... a bummer.

https://www.youtube.com/watch?v=VMt2MK67-Qw

drinchev · on Nov 22, 2017

Linux is used in so many critical systems.

What happens when a security bug stops the ventilating machine of a person lying in hospital bed, or halts the screen of a surgeon.

Not to mention voting machines, ISP's, telecoms.

For me having all those stopped, when properly exploited, looks more like a very scary DoS attack vector.

Imagine a security f*ck up, like Heartbleed, but this time with an option to halt kernels / systems.

teamhappy · on Nov 22, 2017

First things first: Kernels panic and processes crash. If your medical equipment or telco/ISP system can't recover from that then you're in trouble anyway. Why they crash doesn't really matter in that context.

As far as voting machines go, kernel panic sounds waaay better than executing malicious code.

> Imagine a security f*ck up, like Heartbleed, but this time with an option to halt kernels / systems.

IIRC heartbleed didn't allow you to execute code (it allowed you to read more memory than you should have been able to). A better example is every flash player bug ever. Would you rather that thing crashes or executes malicious code? Keep in mind that the malicious code can also shut down your system.

Also keep in mind that we're talking about userspace programs right now. This thread is about kernel bugs. Userspace programs already have the option to ask the kernel to kill them if they misbehave. A lot of them do that (using features like seccomp filter) and many more should. (Chrome and Firefox both use seccomp filter I think.)

AnIdiotOnTheNet · on Nov 22, 2017

I see. It's ok because we'll just pass the buck and make it someone else's problem.

teamhappy · on Nov 22, 2017

Let's say somebody gives you an USB stick and you plug it into your laptop. Which of the following scenarios would you like to see?

1. 0-day in the kernel's USB code. You're part of stuxnet now.

2. 0-day in the kernel's USB code. You're part of stuxnet now. You also get a message that tells you how and where to report the bug that was exploited.

3. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited.

Linus is wrong (it happens). Exploit mitigation techniques aren't debugging tools. They're exploit mitigation techniques. The fact that they also produce useful debugging information is secondary.

AnIdiotOnTheNet · on Nov 22, 2017

This is exactly the kind of thinking Linus is talking about. In 3, I lost my work. Possibly very important work. To most people, being a part of stuxnet, while undesirable, is preferable to losing their work.

And you neglected a scenario 4: nobody is attempting to compromise my machine, but a buggy bit of USB code just crashed my system and took all my work with it.

dwild · on Nov 22, 2017

You never learned at school to save what you are working on often? It's crazy, you are either too old to have required computer for school work or too young to have not lived through years of constant bluescreens.

Both 3 and 4 are mitigated by you saving your document often... it's not so bad, considering it can happen whatever you do.

Nowadays, Word is made to keep saving your change for that reason... They learned and designed it for the worst situation which is a whole system crash. If you can't handle that, well you aren't doing your work well.

AnIdiotOnTheNet · on Nov 22, 2017

Again, passing the buck. You know what I do instead of use your software that crashes all the goddamn time? I use someone else's software that doesn't.

Yes, of course we should save often, have decent backups, etc. But nobody is perfect and shit happens, and it'd be nice if the software you use didn't intentionally make it worse.

CJefferson · on Nov 22, 2017

The problem is, what actually happened (in a previous commit) was:

The IPv6 stack does a perfectly sensible and legal thing. The hardener code misunderstands the legal code, and causes a reboot.

That it was Linus is worried about -- often it is hard to tell the difference between "naughty" code which can never be a security hole, and genuine security holes.

They should all be fixed ASAP, but making code that previously worked make a user's computer reboot, when it is perfectly fine, is not a way to make friends.

teamhappy · on Nov 22, 2017

Bugs in the hardening code are obviously bad and annoying but that's besides the point. All bugs are bad and annoying, especially ones that cause a kernel panic. I don't think anybody is going to argue with that.

That's not what Linus said though. What he said is:

    > when adding hardening features, the first step should *ALWAYS* be
    > "just report it". Not killing things, not even stopping the access.
    > Report it. Nothing else.

and:

    > All I need is that the whole "let's kill processes" mentality goes
    > away, and that people acknowledge that the first step is always "just
    > report".

"Not killing things, not even stopping the access." Oh boy.

cesarb · on Nov 22, 2017

Step back a bit: when developing a new selinux policy, won't you develop first on permissive mode, and only after it's working without warnings, enable enforcing mode? It's the same thing here: the hardening should be developed first in a "permissive" mode which only warns, and then, after it's shown to be working without warnings, changed to be "enforcing" (in this case however, after some time the "permissive" mode can be removed, since new code should be written with that hardening in mind).

teamhappy · on Nov 22, 2017

I didn't mean that to sound like I'm in favor of turning the thing on right away.

(Also, the quotes I chose don't really help me make my case but I don't want to edit now since you've already commented on it. His first mail is way worse: https://lkml.org/lkml/2017/11/17/767)

Basically what I'm disagreeing with is that exploit mitigation's primary purpose is finding and fixing bugs. That's just not true. Its primary purpose is to protect users from exploitable bugs that we haven't found yet (but someone else might have).

CJefferson · on Nov 22, 2017

By first step, Linus just means "for a year or two". Yes it would be nice to put super high security on today, but instead we slowly turn up the setting, from opt in to opt out to forced on, to ensure we don't break anything.

Uristqwerty · on Nov 23, 2017

4. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited, but the part of your computer that was supposed to log the message died with the rest of the system, so you never see it and the bug never actually gets reported. Your computer continues to crash randomly for the next few days as an infected computer keeps trying to spread.

pjmlp · on Nov 22, 2017

> What happens when a security bug stops the ventilating machine of a person lying in hospital bed, or halts the screen of a surgeon.

Linux is not a kernel for this kind of uses. Whoever does it, is doing a disservice to the people.

Operating systems like INTEGRITY RTOS or similar, are the only ones able to match the security quality requirements for such deployments.

https://www.ghs.com/products/rtos/integrity.html

hutzlibu · on Nov 22, 2017

"Linux is used in so many critical systems"

It is? I learned that in those areas you use different, much simpler and therefore more stable operating systems.

snvzz · on Nov 22, 2017

> Linux is used in so many critical systems.

It shouldn't be. It's ill suited for that.

Look at seL4, minix3, echronos instead.

eecc · on Nov 22, 2017

Well apparently even minix3 is not free from critical vulterabilities ;)

https://security-center.intel.com/advisory.aspx?intelid=INTE...

floatboth · on Nov 22, 2017

Intel's shitty apps running on Minix != Minix itself

mcguire · on Nov 22, 2017

Minix 3?

exikyut · on Nov 22, 2017

Hooold it. Some of those things are not like the others.

--

I pity the engineers working on ventilation machines and the like. Medical devices are insanely hard to get right; that's neck and neck with aviation testing. I'm reminded of SQLite3's "aviation-grade" TH3 testsuite, which apparently has 100% code coverage. Let's be honest; Linux's monolithic design can't really attain that.

I would never use Linux for a medical device. I say this as someone who just happens to only be running Linux on every machine in the house right now (and I have for years, it's just how things have worked out, it's not at all novel or whatever, my point is that I'm totally comfortable with it). I'd use L4 or something instead. In a pinch I'd use a commercial kernel with tons of testing. Maybe I'd even use Minix; I'm quite sure a lot of people in industry are seriously looking at it now Intel have pretty much unofficially greenlit it as a good kernel (lmao).

--

Voting machines, on the other hand; I'd totally use Linux for that, because the security/usage model is worlds apart. Here, I WOULD ABSOLUTELY LIKE FOR THE TINIEST GLITCH TO CRASH THE MACHINE, because that glitch could be malware trying to get in.

The user experience of a voting machine is such that you walk up to it, identify yourself, and push a button. Worst case scenario in this situation is that you do some involved process to ID yourself and then the unit locks up, so you have to redo the ID effort on another unit. That is, for all use cases, not going to be a problem.

(I think that's the first time I've used all caps in years!)

--

Telecom systems... those are also a totally different world. See also: Erlang. In this situation you would likely want a vulnerability to literally sound a klaxon on a wall, but have the system still keep going.

I'm reminded here of an incident where a country's national 3G system was compromised (not the US, somewhere else) by hackers and the firmware of the backend systems was hot-patched (think replacing running binary code - the OS allowed it, it was REALLY hard to even notice this was happening) to exfiltrate SMS messages and cause calls to certain numbers to generate a shadow call (which ignored mic input) to an attacker-controlled number as well.

Telecoms is a classic case of massive scale; nowadays a single telecom switch might be routing thousands of calls through at a time. Yeah you don't want even a single machine to go down. But you DO want VERY thorough debugging, auditing and metrics.

(Which apparently don't exist.)

--

As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.

TickleSteve · on Nov 22, 2017

Erlang's error-handling model is good (and interesting). Th e motto is: "Let it crash".

Each node does not handle errors at all, but PANICs on a fault. It is up to the supervisor (with global knowledge and state) to handle the fault appropriately.

bjourne · on Nov 23, 2017

It's good for uptime, but not good for correctness. The main problem is that it is hard to differentiate expected from unexpected crashes. Something like a missing pattern match can lead to a crash and it is very hard to know if the programmer "intended" for a crash to occur in that case or if the missing pattern is a bug.

You can have processes that reboot once every few minutes running for years because people didn't realize they were bugged.

ac29 · on Nov 22, 2017

>As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.

Light on details, but the vulnerabilities are disclosed and fixed [0]. ME updates are already available from many OEMs.

[0] https://security-center.intel.com/advisory.aspx?intelid=INTE...

exikyut · on Nov 23, 2017

Right. But "don't apply the patch!" is sort of circling as well, because (presuming the Blackhat disclosure is workable, it sounds like it will be but fingers crossed) we might be able to play with our MEs.

hpaavola · on Nov 22, 2017

Many medical devices run Linux. Most (AFAIK) patient monitors run Linux; GE and Philips (the biggest is business) both run on Linux. Those are the devices that keep you alive during surgery, make sure that those who are born too early (I don't know the English term here) are doing ok, monitor you state while you are in ambulance etc.

TickleSteve · on Nov 22, 2017

No...

Many medical devices run Linux as a User-Interface... (or Windows for that matter).

The actual safety-critical portion of these systems is rarely running Linux, but rather on a bare-metal micro.

exikyut · on Nov 23, 2017

That makes a lot of sense.

I'm reminded of a UAV doing the same thing. It ran L4 for low-level control, realtime scheduling, and security, and then virtualized Linux on top of that.

Sounds unbelievably clunky on the surface, then you realize it's a remarkably useful way to abstract everything cleanly.

weavie · on Nov 22, 2017

Born prematurely..

TickleSteve · on Nov 22, 2017

For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.

PANIC on fault is exactly what you design into the systems.

What you find is that the truly safety-critical portion of the system is running on a microcontroller and the UI (which is not safety-related) can run on Windows or Linux.

dullgiulio · on Nov 22, 2017

Thank you for that!

Seems like so many in security fail to see the DoS implication.

There is no solution to bugs other than fixing them. And that's what Torvalds and others have been saying: for a security researcher, finding the bug is the end of the job. For developers, that's just the start.

zzzcpan · on Nov 22, 2017

A better way of course is not to halt on assertion, but to limit the scope of any potential problem in a such way, that an assertion could only crash a tiny isolated thing and trigger its restart, possibly not impacting availability whatsoever. You still get your sanity, but also get users happy with a rock solid thing that just works even in a presence of errors.

The idea is known as supervision trees.

_asummers · on Nov 22, 2017

Erlang VM works this way, if someone's looking for a program that does this in practice. They have the mantra "let it crash". Something higher than you is in a better position to handle your error and restart you to a known good state.

Fnoord · on Nov 22, 2017

I think a good tradeoff could be that with containers, the individual containers are hardened, whereas the kernel's host OS is not. The host OS doesn't do much except keeping the containers running.

Santosh83 · on Nov 22, 2017

AFAICS, Linus also wants it, but he wants a panic to be preceded by a rather lengthy span with just a warning, allowing the concerned dev to actually fix the error. Essentially he's saying: "take it slow, and don't break user experience."

makomk · on Nov 22, 2017

Due to the aggressive nature of grsecurity, a lot of the assertions it trips on are bogus; either they didn't understand the code they were securing or they changed the rules without properly updating all the affected code. For example, there was a particularly obnoxious panic in the tty layer a few versions back that was entirely the result of this.

iforgotpassword · on Nov 22, 2017

Servers: yes, phone/home PC: no. Maybe my dev machine but my wife wouldn't be happy with a kernel panic while writing an email. Like Linus said, this would go unreported with the average user because they just reboot in annoyance.

digi_owl · on Nov 22, 2017

More and more it feels like the _sec world wants to go back to C64s dialing into big irons...

jstanley · on Nov 22, 2017

Because you'd rather just do your work and not have to deal with unnecessary kernel panics?

If the system can continue running, it should do.

CGamesPlay · on Nov 22, 2017

When the invalid write overwrites some piece of data your application doesn't care about (or more likely some feature in some driver you don't care about). Especially when the trade-off is the web site goes down.

jhasse · on Nov 22, 2017

For the cases where it was a false-positive.

otabdeveloper1 · on Nov 22, 2017

A.k.a. 99.9999% of the time.

babarock · on Nov 22, 2017

> Why would you not want this?

You don't want your machine to start crashing after installing the latest kernel. Or at least, that's the golden rule of Linux development.

If phones start crashing after installing the latest Android update, people won't see this as a security/stability improvement. They'll simply see the new version as buggy and of poor quality.

lima · on Nov 22, 2017

It's a deliberate trade off. Not panicking results in uncaught exploitation attempts, and panicking will result in crashes where a vanilla kernel would happen to survive.

It should have been made a sysctl toggle.

rhinoceraptor · on Nov 22, 2017

If an assertion does not hold, you have a real problem. The kernel has in-memory data corruption. Either from buggy code, bad memory, solar radiation, etc.

So if that assertion is in the file system, maybe your kernel should die before it corrupts your data permanently.

mcguire · on Nov 22, 2017

Depends. How attached are you to getting work done today?

I too would like to live in a world where sanity preserving assertions have rational consequences. But that world is not this world, and pretending it is won't help you get there.

kazinator · on Nov 22, 2017

> This is exactly what you want if you care about security,

A deliberate panic could be the basis of a denial-of-service exploit.

mcguire · on Nov 22, 2017

"Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus)."

If you really cared about security, you'd leave the box unplugged.

"I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours."

Speaking as someone who has done middling large scale production administration, that's not reassuring.

nerdponx · on Nov 22, 2017

Why isn't Grsecurity publicly available anymore?

ac29 · on Nov 22, 2017

https://grsecurity.net/announce.php

Reading between the lines, it seems to be a money thing.

nerdponx · on Nov 22, 2017

Sigh. Tragedy of the commons.

mtgx · on Nov 22, 2017

Because Linus hates them? Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.

pgeorgi · on Nov 22, 2017

> Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.

But only because every self-driving car project out there will avoid Linux like the plague.

TimPC · on Nov 22, 2017

Instead of self-driving cars maybe crashing from being hacked in some possible future we'll get kernel panics leading to crashes in all possible futures because we'll trigger car crashes on every false positive, because crashing in the face of the unknown is a seemingly acceptable solution to a security risk. Even if the software is running self-driving cars and crashing may mean crashing. In practice, false positives are way more common than exploits and in many user cases they'd rather have 1 computer exploit than 1000 or 10,000 crashes.

torpcoms · on Nov 23, 2017

You would rather have a self driving car in an undefined state, rather than having it shut down? A random glitch could be just as bad as an exploit; if some chunk of memory gets overwritten and your car decides that the brick wall doesn't actually exist any more, I don't think whether it was an an exploit or not really matters. The occupants end up injured either way.

pgeorgi · on Nov 23, 2017

The undefined state might be in the GPU driver handling the heads-up display, or maybe in the sound subsystem. No need to shut down the system at the kernel level for that. Report the issue to the userland, so that it can decide whether to initiate a safe halt at the sidewalk or emergency lane.

After telling the kernel to shut down immediately you don't have that option any longer.

sgentle · on Nov 22, 2017

It's interesting to see this laser focus on a particular kind of user. If you're running Linux on a server, you're a user, but unless you're very irresponsible you would probably rather your programs crash than give away private information. Your interface is to a cluster of machines where individual crashes are probably not that big a deal.

If you're running Linux via Android, you're a user, but mostly you're a user of actively developed apps on top of an actively developed OS, usually pegged to specific kernel versions. Your interface is to that layer on top, and given that its code is written by app developers and hardware vendors who will ship anything that doesn't crash, you probably want security bugs to crash.

It seems to me that the kind of user Linus means when he talks about "the new kernel didn't work for me" is a user of Linux without any substantial layers on top, where kernel updates happen more often than userland software updates, and where individual crashes have a significant impact. In other words, users of desktop Linux.

But I wonder if that focus on desktop Linux really reflects the majority of users. And, if not, perhaps it might make sense to have "hardening the Linux kernel" as the first step if it makes "raise the standard for the layers built on top" the endpoint.

mrsernine · on Nov 22, 2017

>you would probably rather your programs crash than give away private information

Crashing on a security issue is a good thing for every kind of user. Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation. The problem here is that hardening methods lack the ability to make that distinction.

titzer · on Nov 22, 2017

> Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation.

How do you square this with the reality that "keep on truckin" is generally the path from bugs to security exploits, and has been shown to be over and over in the wild?

mrsernine · on Nov 22, 2017

Bugs will happen, that's a natural law of computer science, if you keep on trucking over them you will be delivering buggy software that is likely to cause problems. Even if you chase them down and correct them all, your software is still going to have bugs, that's a fact of life.

Should code containing bugs be allowed to run? If the answer is no we must ask ourselves how much software we have today that is completely bug free (that will be 0%).

I still think these proactive approaches are good to disclose possible exploits, but killing processes just because they might be exploitable is a very long shot.

AnIdiotOnTheNet · on Nov 22, 2017

And yet one of the big complaints about Windows of old was how often it crashes.

protonfish · on Nov 22, 2017

Crashing randomly for no good reason isn't the same thing as crashing on a security exception.

gjulianm · on Nov 22, 2017

One of the point of Linus is that most of the crashes will not be because of an active attack but because of a possibly latent bug that sometimes appear and might very well be non exploitable. So probably people running server would like to have everything working instead of having random crashes on processes or drivers that are not the core of your service but can affect it. And given the size and complexity of the kernel it would not be strange to have these crashes appear only on certain setups and not necessarily on the ones of the people that are testing it first.

lima · on Nov 22, 2017

Sadly, fault-tolerant clusters where you can tolerate the loss of a single machine aren't the norm.

There are many (MANY) server applications or industrial use cases that do not handle random kernel panics very well.

I still prefer "crashing" over "silently ignoring critical errors", but you cannot generalize it like that.

Nokinside · on Nov 22, 2017

I don't think this is fair characterization.

Errors are less likely to be actual exploits in servers too. When the kernel panic is caused by faulty driver failing network hardware, or user land software failures, it can take down multiple servers or all of them at once.

Most of the information in servers is private but not sensitive. You don't want anyone have access, but correct functioning and security warnings are more important than maximum information lock down.

btw. I don't see reason for not having kernel option to turn warnings into kernel panics.

jbangert · on Nov 22, 2017

It’s not even desktop users; most desktop users download Ubuntu and never touch anything, on reasonably common PC hardware. Kernel regressions mostly get caught in the Ubuntu betas or testing tracks (e.g. Debian Sid).

The typical user the kernel developers focus on here is a kernel developer: always running the latest kernel with a stable user space. I find it extremely narcissistic that they reject security improvements for billions of devices, for what essentially just makes developers lives easier,

Santosh83 · on Nov 22, 2017

Very pragmatic. He sees software in the overall context of getting a job done with a computer, imperfect though it may be, instead of dying because it was not perfect.

Unlike a segfault from a user space program that indeed merits a 'kill', the kernel should strive at all costs to keep running, since kernel panics are so much more inconvenient.

digi_owl · on Nov 22, 2017

Just wish that the higher layers of the stack would see it as well, rather than being angry at Torvalds about it and going about how users are dumb sheep.

theWatcher37 · on Nov 22, 2017

Absolutely not.

If “do no harm” is a principle, then the kernel should ensure that no harm is taking place.

If flaws within the kernel allow harm to occur while otherwise normal transactions are occurring then it is absolutely preferable to panic and shut down over allowing that potential harm to occur.

To suggest otherwise, that detected errors that allow harm should be allowed, is pure insanity.

Linus is unquestionably wrong in the regaurd.

anameaname · on Nov 22, 2017

A thought experiment that comes up in Kernel design classes is what should happen if the OS was running the flight-control software for an Airplane you are on? If there was a bug in the kernel, perhaps a double free or a memory leak, what should happen?

A panic would result in the airplane falling to certain doom. But if it were to keep running, it may be a security vulnerability. Being absolutist in either direction of the discussion will lead to absurd scenarios where you would make the wrong decision.

Nokinside · on Nov 22, 2017

> double free or a memory leak, what should happen

Both offensive and defensive programming is important in safety critical programs and I get your point, but those things you mention don't' happen in safety critical systems.

There is no dynamic memory allocation. RTOS used will support "brick wall partitioning" for memory, processing and other resources. Different systems can run in the same OS but they cant' compete for processing time, locks or memory access. Everyone has been dealt the resources they can have from the start. It's not possible to run out file descriptors, memory if you allocate them statically from the start.

Assertion errors or monitoring errors in safety critical systems usually cause reset or change into backup system. If the program state is large and reset is not safe, retreating to some earlier state (constant backups) is likely.

Annatar · on Nov 22, 2017

but those things you mention don't' happen in safety critical systems.

Errors in logic happen everywhere.

Nokinside · on Nov 22, 2017

_those_

dynamic memory allocation errors don't happen when there is no dynamic memory allocation.

mcguire · on Nov 22, 2017

http://www.space.com/26593-apollo-11-moon-landing-scariest-m...

https://www.doneyles.com/LM/Tales.html

Slavius · on Nov 22, 2017

Kernel is modular. Literally everything can be enabled/disabled.

Aviation has strict regulations and that's why most critical systems have redundant parts. Putting a sigle critical component into plane is stupid in and of itself. Think of simple freezing in high altitude or overheating otherwise. On the other hand I would rather fly in a plane whos altitude meter shuts down and switches to redundant circuit other than letting it report incorrect values...

rhinoceraptor · on Nov 22, 2017

The flight control system should absolutely panic. No (sane) person designs flight control systems without multiple redundancies.

kyberias · on Nov 22, 2017

You seem to be one of those "security people" Linus refers to. ;)

Harm is relative. You security people think that every single security issue is so important that it doesn't matter what harm mitigating that may cause, it can be done. Well, that is not what Linus thinks.

Kernel may terminate a process because it did something of suspect but doing that may actually cause way _more_ harm.

The philosophy here is that security bugs are just bugs.

iforgotpassword · on Nov 22, 2017

No. First, you never know if an invalid access or integer overflow can actually be exploited, and even if, you don't know if it can remotely be exploited. If you run a server hosting sensitive data by all means, use grsecurity but on my home PC where I browse Facebook and send emails, fuck off with your kernel panic.

Santosh83 · on Nov 22, 2017

But as he explains, these are latent bugs. They may or may not be targetted for exploitation yet. Meanwhile hard crashing over them can lead to poor user experience, while throwing up warnings and leading to an actual fix would be better from user PoV, unless of course userspace actually depended on the buggy condition, but that's another discussion.

monocasa · on Nov 22, 2017

It's not so black and white. Medtronic uses Linux. Do you want to be the guy who's medical equipment spontaneously reboots because of a bug that wouldn't have otherwise affected anything?

Slavius · on Nov 22, 2017

Linux is a modular kernel. I'm not aware of a single thing you can't disable or make modular during config/compile. I woudn't like to be the guy who's medical equipment killed him by slowly decreasing his oxygen levels due to buffer overflow either. If you're in this kind of business you take responsibility by discovering and fixing bugs which would go unnoticed otherwise. And if the life of your patients really depend on your equipment then having a redundant component within your device is a must.

dawnbreez · on Nov 22, 2017

You mistake "do no harm" for "don't let the user do harm". This is "do no harm" in the sense that Hippocrates said it; your job as a kernel security dev is to not harm the user, just as a doctor's job is to not harm his patient.

carlsborg · on Nov 22, 2017

"without users, your program is pointless, and all the development work you've done over decades is pointless.

.. and (then) security is pointless too, in the end."

He tends to get really mad when kernels dev inconvenience user space devs. Perhaps one of the reasons Linux succeeded was because of this fanatical customer focus - if linux is the platform, user space developers are the customers.

babarock · on Nov 22, 2017

This.

Not breaking user space is fundamental to the success of most operating systems today. I read this article I cannot seem to find about Microsoft employees spending months replicating "wrong" behavior in <oldwindows.. win95?> so that applications still ran on <newerwindows... win98? win2000?>. Users don't want to see their applications crashing.

I wish I could find this article again, it's very relevant to your comment.

aurelianito · on Nov 22, 2017

It reminds me of the hack Microsoft did to make simcity work in Windows 95. https://news.ycombinator.com/item?id=2281932

digi_owl · on Nov 22, 2017

And then userspace goes on to piss on that by breaking their own stuff left right and center...

kyberias · on Nov 22, 2017

I think the earlier message drives the points home in more familiar Linus style:

https://lkml.org/lkml/2017/11/17/767

"Some security people have scoffed at me when I say that security problems are primarily "just bugs". Those security people are f*cking morons."

Gotta love the guy. :)

smnrchrds · on Nov 22, 2017

The person at the other end of the conversation would disagree with this sentiment:

"Thanks. Still, I'd prefer Linus yell at me than other folks trying to do similar work. If I can shield anyone from this abuse, then maybe they won't give up on kernel security development. Digging Linus's actionable feedback out of the ad-hominem attack can be challenging." [1]

[1] https://twitter.com/kees_cook/status/932694978366619648

rimliu · on Nov 22, 2017

I am happy Linus shields Linux kernel from those golden-hearted individuals whose feel-good attitude would allow questionable solutions designed by committee to slip in.

kadenshep · on Nov 22, 2017

>Digging Linus's actionable feedback out of the ad-hominem attack can be challenging.

They're not really that separate. He's being totally disingenuous and still letting his own fragile ego get involved.

kyberias · on Nov 22, 2017

> He's being totally disingenuous and still letting his own fragile ego get involved.

Who is?

kadenshep · on Nov 22, 2017

The person complaining.

alanfranz · on Nov 22, 2017

That's a consequence of an "old" issue in the IT security field - security researchers and developers sit at opposite sides of the table, they've got different concerns and agendas.

Pick some security researchers; now tell them to build any nontrivial piece of software; I doubt they'd be able to do it, and if they succeed their software will be full of bugs, including security ones.

Security is part of the correctness and of proper building of the software, so it should be integrated into software development. Security experts can (and should) still exist, but the current state, where the infosec people appear to rule, is pointless - exactly because the same infosec people wouldn't be able to deliver better software than current developers.

I highly regard somebody who can write a software without security bugs; I don't regard as highly somebody who shows me the bugs, but would be unable to write that software at all.

Let's turn the infosec objective: not to undiscover security bugs, but to write software without security bugs. Then we're at the same side of the table.

lima · on Nov 22, 2017

That's not a fair (or useful) assessment. Obviously, the narrow-minded security people you describe exist, but they're a minority. Many security people are developers who specialized in security, and are very much capable of building software.

The kernel code is question is exactly what you ask for - instead of finding and fixing single bugs, it's a mitigation that prevents all occurrences of a particular class of bugs.

PeterisP · on Nov 22, 2017

The mitigation does not prevent a particular class of bugs, it prevents a particular class from being exploitable by turning an invisible-but-possibly-exploitable bug to a crash bug. The bug is still there, but now it has a larger impact on most customers.

That's a serious tradeoff, especially (as Linus is complaining) turning a rare bug into a crashbug doesn't allow you to detect and fix it, you need a mode where the bug is logged but the process is not stopped (and might be exploited) so that the bug can be reported, reproduced and fixed.

alanfranz · on Nov 22, 2017

Of course it's "anecdata" because I've never personally conducted a test of all security consultants for their programming skills.

But the people you tell about (developers who specialized in security) maybe exist in large shops (Google, MS, FB), but many, many security consultants will work for specialized firms that offer security services, but not software development, and vice versa. Take a look at most Defcon/BlackHat talks where a vuln is explained/uncovered/exploited: most such researchers don't pertain to a software development firm, but to independent security firms.

Why should my assessment not be useful? I proposed a very clear solution to what is the problem.

Source: I worked for almost ~10 years in a firm who had both a software development and a security services branch, met tens of security consultants and worked in remediation activities for software issues where the security consultants weren't able to do it.

EDIT: about the kernel code, I agree with you that such code is a step in the right direction, but I agree with Linus that the "warn" should come before the "kill".

watwut · on Nov 22, 2017

When I was looking to learn about how to systematically make secure software, I did not found all that great actionable information. There is a lot about particular hacks and vulnerabilities, lists of popular vulnerabilities categories etc. There are was one book dealing with architecture and such I found. Development quite clearly is not focus of security research.

A lot of advice, especially that found on blogs, was literally naive and felt like something written by someone who never even seen larger team working.

extrapickles · on Nov 22, 2017

As a infosec guy who was a software developer, its non-trivial to write actionable general security advice.

There is an entire academic field of study on making network related security blunders hard (lang-sec). It generally boils down to do all your parsing in one spot and a small set of features are evil.

What is really needed is a site where one can pick a bunch of features that your software project has/wants and then it gives semi-tailored advice on what to do, what to watch out for, or that you need to rethink things (eg: rolling your own TLS implementation=world of hurt).

lima · on Nov 22, 2017

There are many good books on security architecture. The technical details change every week, but the fundamental approach does not change.

"Security Engineering" by Ross Anderson is a must-read:

https://www.cl.cam.ac.uk/~rja14/book.html

(it's freely available, too)

pjmlp · on Nov 22, 2017

Many developers care about security the same way as security researchers.

Developers that care about design by contract, assertions, having warnings as errors, pedantic flags, get the CI system to break builds on static analyses errors, only allow for unsafe code when confirmed by profile measurements that it actually matters,....

Yet, we are able to deliver working software that fulfills project requirements, amazing!

zaarn · on Nov 22, 2017

>It's that the code has been RUN BY USERS for months. If it's been [...] in grsecurity for five years [...] It only means that hardly anybody actually ever ran it.

Subtle burn towards Grsec, I laughed a bit.

In all seriousness, I think Linus is somewhat on the right track. Security Patches should foremostly not break anyone's workflow (maybe except the evil haxor's workflow) and rather print a warning until the exact implications of a full Terminator-mode patch is understood. Because people won't upgrade to kernels that break their workflow and a warning in the kernel log is better than a vulnerable kernel.

quink · on Nov 22, 2017

I'm reminded of a recent security fix for IE11 that Microsoft pushed out earlier this year:

https://developer.microsoft.com/en-us/microsoft-edge/platfor...

It killed printing from iframes completely. Great that they solved the security problem, whatever it may have been. They also broke a major piece of browser functionality that a lot of enterprises rely on fundamentally. Hell, even printing shipping labels from eBay was broken, and so was hilariously Microsoft Dynamics 365. And our own product.

And the first response by Microsoft on this very link? "Won't Fix" and keep a major bit of fundamental browser functionality - printing a document - not working. "Do No Harm".

pjmlp · on Nov 22, 2017

Because there is a workaround, applications just need to be updated.

"Either use Print Preview instead of print button or if currently using window.print() in javascript change this to document.execCommand('print’, false, null)"

AnIdiotOnTheNet · on Nov 22, 2017

> Because there is a workaround, applications just need to be updated.

What actually happened: people reverted the patch. In the real world, you can't expect timely or even correct response from vendors you rely on. It sucks, but it's how it is.

torpcoms · on Nov 23, 2017

Then after you get compromised in some way, you will start to look for better vendors.

AnIdiotOnTheNet · on Nov 23, 2017

Only if the cost*risk of a future compromise is > cost of replacing the product. This is rarely going to be the case for business critical software.

quink · on Nov 22, 2017

> applications just need to be updated

Updated away from web standards to false and null and random string. And then they made window.print work again.

Additionally what guarantee was there that execCommand(magicString) where magicString is `print` wouldn’t be removed in the next security “fix” because the security fix should have broken this too as it did window.print? After all, it seems to do the same thing. Answer: none at all and then you’re back to square one.

throw2016 · on Nov 22, 2017

Some security folks tend to have tunnel vision. They are focused on their own little patch and the first instinct is to control and shutdown with zero concern for usability. job done. bye. But that's too self serving and lazy.

And if that's what they want fine, release your own app or distro and let people who need/desire that level of security choose security over everything else without trade-offs.

The worst thing is a kind of entryism and trying to impose yourself on people operating under different constraints. That's why you always need a strong manager to balance interests and this is the role Linus is playing. More often that not with this kind of imposition something comes out of the blue, and you are left squandering hours to fix things and get back up and running.

jbangert · on Nov 22, 2017

A few points:

1) failing loudly is better than failing silently. A memory corruption issue (or a bad refcount, etc.) is not a benign issue that only becomes relevant under carefully crafted exploit conditions. You need the carefully crafted exploit to get the system back into an attacker controlled state (I.e. code execution); by itself (with non-malicious inputs, usually something random or slightly atypical — enough to not have been noticed yet, but typical enough that some program does it) the system is likely to either panic immediately (same result as with pax) or to corrupt some memory, in which case you will have a lot of strange behaviour to track down later (users will probably blame them on hardware or on their user space, so you might never see them. for example a recent OSDI paper showed that ext3/4 had several real world data corruption bugs. If these aren’t as frequent as the recent bcache issues, no one notices).

2) When I was doing research projects (into memory defenses on the kernel) about 3 years ago, there was no (commonly used, that I saw) automated testing infrastructure in the kernel. This makes catching regressions, especially in drivers for rare hardware, hard to catch. While tests aren’t a panacea, i think Linux overestimates what fraction of problems Code reviews will catch.

3) the “don’t break user space” strategy is already failing. Every mainstream distribution and embedded vendor stays on an old kernel branch. Big deployments do staged rollouts and extensive burn in tests. This isn’t just because the kernel, but because of extensive abreaking changes everywhere (compilers, standard libraries, etc. all need to change sometimes).the last time this happened, IIRC it was some audio bug in a strange configuration. In my experience, running a non standard Linux audio confit causes countless breakages, so an additional one in the kernel that might save my personal data from being exfiltrated is worth it. Most users have average (and therefore well tested) setups, which means thy won’t see breakages as often.

Perfect software doesn’t exist, and even MSFT backed off maintaining religious backwards compatibility (note that Microsoft’s approach was not to flame at developers and hinder new development, but through extensively building compatibility shims. Often, these came with trade offs strongly in favours or security, e.g. UAC).

Breaking user space is ok; users already expect breakage, and the cost of the additional breakages is low (to users and to society as a whole) compared to the cost of security breaches [citation needed, but Linux kernel security is relied on in a lot of places].

blaisio · on Nov 22, 2017

So one of Linus' main points in this series of posts is that failing loudly is actually not always better than failing silently or quietly, and it's really annoying when people come in making that assumption without thinking. This is also something that he is constantly repeating and ranting about, and it's arguably one of the reasons why Linux is so successful.

Think about a smartphone - do most users want it to crash and reboot, even if some error (which could end up being a security issue) occurred? The answer is no, absolutely not. The crashing and rebooting itself isn't really that helpful. Reporting the bug to the Linux developers _would_ be helpful.

Some people do want the frequent crashing behavior and that's okay, but it's not okay to make that decision for everyone.

Also, users might expect minor breakage if someone somewhere makes a mistake, but that doesn't mean it's okay. That's like saying if someone always washes their hands before eating, it's okay if they get sick, because they were expecting that they might get sick.

pjmlp · on Nov 22, 2017

Interesting that you mention smartphones, because that is exactly what Google has made to their Linux fork.

Every Android app that misbehaves, just gets killed without warning.

The scenarios where this might happen, have been increasing since Android 7.

AnIdiotOnTheNet · on Nov 22, 2017

> Breaking user space is ok; users already expect breakage, and the cost of the additional breakages is low (to users and to society as a whole)

Which is why everyone loves rolling releases so much that Windows 10's forced upgrades are universally praised and Linux Desktop has a dominant market share.

chris_wot · on Nov 22, 2017

So tl;dr - fixing a security bug is rarely the end of the story, fixing the root cause is far more important. And don’t piss off the users.

_nalply · on Nov 22, 2017

If your security patch kills users' buggy processes or even crash their systems then you are a «bad security person». Please report the bad access first so users and developers of their software have time to fix the bug. Upgrades disabling users' software are a big no-no. After all security is meaningless for a non-working system.

topspin · on Nov 22, 2017

Linus has been pounding on the bench saying "Don't Break Userspace" for over two decades now. This is really just another manifestation of that same policy. Some 'security' folk apparently believe they occupy a special place in which they are exempt from this policy. Linus is showing them they're wrong; evolving the kernel into a minefield in the name of security is not the way to world domination.

pythonaut_16 · on Nov 22, 2017

And I think a big part of Linus's point is that if you break userspace for security now people will be less likely to upgrade their kernels in the future.

So they'll be secure from a hypothetical possible bug today, and completely vulnerable to real demonstrated security exploits in the future.