It seems to me that for a long time now stable releases haven't been trying very hard to follow the stated policy [1], in particular the parts that say
> It must fix a real bug that bothers people (not a, "This could be a problem..." type thing).
and
> It must fix a problem that causes a build error ([...]), an oops, a hang, data corruption, a real security issue, or some "oh, that’s not good" issue. In short, something critical.
It's a long-standing issue/debate at this point. The maintainers of long-term stable kernels have for some years now used scripts to automatically pick as many appropriate-looking stand-alone patches as possible, looking at what's merged into Linus' "tip" tree (usually to find fixes targeting rc(N) kernels). Since they do the work they get to decide, but many others have found this an inappropriately regression-prone way to manage longterm-stable kernels. If you want all the latest fixes and improvements, you can use the latest kernel. If you use the longterm-stable it's because you want less risk from unrelated/unmotivated changes, you want only the specific fixes found to be most needed.
I have no strong opinion on what the policy should be, but I think there's no excuse for allowing the stated policy to get so far out of sync with the actual policy.
That is, I think the stable-release maintainers should update Documentation/process/stable-kernel-rules.rst so that it tells the truth.
(I think this is about normal stable kernels, not 'longterm' ones. I don't think 6.0 was expected to become the next LTS release.)
Yeah, good point, in any case I would expect the "most recent stable" branch to get _all_ the "potential fixes" (that otherwise would only be available in RC releases).
Is this the sort of thing you feel is worth someone else's time and effort, or do you think it's meaningful enough to consume some of your own? If the latter you can offer to do it, if the former you can pound sand.
The issue isn't effort here, but that each backported fix risks breaking something else. People who want all the latest fixes can choose the latest kernel, instead of relying on backports to stable kernels.
Out of all the compiler bugs I have failed reports against over the years, only one was fixed. And the fix was applied only in the "trunk" branch, not any of the other ones still "maintained".
I learned that Gcc has a policy not to fix performance regressions on any but the development branch.
Back in the late '80s we spent fully half of each working day tracking down the trigger and workaround for cfront crashes. Kids today have it easy.
“ There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible.”
I say, it’s “rarely a compiler bug.” When teaching new developers I remind them that you should assume the bug is in the code you just wrote before jumping to the bug being a compiler, framework or OS issue. It’s just a matter of probabilities.
In my +- 25 years of experience, I've encountered exactly 1 compiler bug. That cost me a full week of debug. Unfortunately, it happened during my very first year as a developer, and that triggered sub-optimal debugging practices (ie: is that the compiler fault again?)
At the end of the day, you want to optimise for debug time. There's the probability that it's a compiler/developer bug (respectively very low/very high) and the time it takes to rule it out. It's of course best NOT to start by investigating the compiler bug.
I find several every year. Most people simply don't see compiler bugs because they don't test (and most of the time don't need to) their compilers extensively enough (by testing I mean they must build, run the generated binaries, and compare the output ).
They most likely do (just look at the gcc bugzilla).
I find them because of a combination of :
- a very large codebase
- compiling with optimizations (O3) and targeting recent archs
- yearly compiler upgrades
- an extremely extensive testsuite
I have found all kinds of bugs (frontend, middle, backend) - the codegen ones tend to be very nasty to diagnose.
> Otherwise it would imply large companies (Google-scale) experience thousands of compiler bugs per year...
This seems likely to be true, from my experience.
Fortunately, most compiler bugs I run into (mostly with gcc and llvm) are not code generation bugs (which can eat months of debugging), but just segfaults / rejecting correct code / other broken stuff.
I also suspect most programmers spend most of their time staying more or less on the “straight and narrow”. There’s lots of obscure features in most languages (especially in old languages like C++) that not many programmers use.
These features are much more likely to have buggy edge cases in the compiler. And only a small select group of programmers will run into those bugs over and over again.
Eg complex template metaprogramming - which had known broken corner cases in all C++ compilers up until a decade or so ago.
Fun. Basically, it decides that (due to UB) it can simply not materialize the argument and carry on.
On Darwin/x86_64, this actually means that it prints out bits from an "uninitialized" register, specifically %rsi in this case, since that's where the second argument should be (even for variadic functions).
Certainly a jarring failure mode! However, if updating your compiler causes you to run into this, you already have a bug in your code (one that clang -- arguably not forcefully enough -- already warns you of).
Yeah I was debugging an issue that was occurring in a Kubernetes environment last night... kept wanting to blame the out of date Istio version. I just could not trust it and that lack of trust stopped me from seeing the whitespace error that was in front of me the whole time...
I think it's really dependent of the domain you are working in. Some stuff is better tested than others. In my 5 years careers I found bugs (found and confirmed by compiler authors) in gfortran, xlf and pgfortran. So yes bugs in compilers are rare but sometimes you just keep running into them.
In C it's usually undefined behaviour that causes the compiler to produce code that doesn't behave as expected. So not actually a compiler bug, but indistinguishable from one until you identify the UB.
I’ve found two compiler bugs, both in rustc. One was a hang, so obvious. But the other, I and the people chiming in pretty quickly figured out it was a codegen problem, as it turned out due to enabling noalias before it was ready. Just cut swaths of code until you get it down to no unsafe. It was a pretty good experience.
Delta debugging and C-reduce are your friends there. I dearly miss creduce in my language of choice... Not that compiler bugs are that frequent but when generating so much weird code I tend to trigger them bugs more frequently.
I've found that as I get increasingly experienced/skilled, that ratio of the problem being in my code vs compiler, framework or OS issue decreases.
Not that most problems still aren't in my code, but I've increasingly run into framework/library level issues and very rarely compiler issues. Can't say I've triggered any kernel issues yet.
Doing zero troubleshooting and immediately jumping to the conclusion that it's someone else's fault is frustratingly common. If you want to make a network engineer reflexively reach for a bottle of liquor, DM them on Slack "hey is the network down my app/service/database/whatever is unreachable".
Networks can and do go down of course. But in my experience, the vast majority of these issues are actually the result of something extremely mundane like a typo in a hostname. I've resolved an incredible number of issues over the years by simply reading the error message someone sent to me and asking them to check the exact thing that the message says is wrong.
Not the first annoying networking bug introduced in 6.0.16 — if you have a CIFS mount, you get a kernel panic. There's a confirmed bug report about it on Red Hat's and the kernel's bugzilla. They don't seem to be related, so it's just coincidence.
Not in the Linux kernel. In the Linux kernel it has traditionally always been done by the developer sticking lots of printk statements into their code until they are happy with the outputs and then removing most of them. Make of that what you will.
I'll correct myself. That was how it was done circa 2000-2005, but my info was way out of date because I haven't followed kernel development much for a while. Here's some stuff about how it's done now. https://www.kernel.org/doc/html/latest/dev-tools/testing-ove...
Basically there is a system called KUnit for writing white box unit tests and there is code coverage to determine coverage as you would expect, then there are systems for static analysis and verifying assertions.
As you can see from the patch that introduced the bug, the kernel is severely under-tested and has little to no testing culture among the core contributors. The patch changes logic but no tests, indicating that the changed code has insufficient test coverage.
The biggest mindset adjustment I ever had to make was moving from C++ to Node.js.
You can usually trust that select() is not broken, but select.js is deprecated, select-kitchensink.js is not compatible with your toolchain and unicyclect.ts leaks memory like a sieve
For curiosity I upgraded the kernel of my VPS running Debian 11 from 5.10 to 6.1. The networking became visibly slower. All TCP receive and send rate droped to at least 1/4 of the original. I had to restore the original kernel and the network became fine again.
I don't think it is a bug in the kernel. More likely the userspace or the host system needs to be reconfigured to work well with the bleeding edge kernel. Still I am curious, what exactly may cause this dramatic drop?
> That is the kernel bug tracker, not the distributions bug tracker.
The point is that the contributors reasoning for being on the tree is irrelevant. Like you said, they just made it EOL
Distributions are the continuous/constant answer as to why countless people will be. This isn't an ancient release, something from the grave.
Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?
Neither of these are particularly tenable. I'm glad GKH was willing to accept further changes to make it correct, but reverting is also applicable.
Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.
>Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?
Correct.
>Neither of these are particularly tenable.
Yes they are.
>Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.
You're awfully confident about how things should work, even though you don't understand how they already work.
Distributions have their own release strategies to pick up what are now in the 6.1/6.2 trees. Some were cut at a bad time where they're treating the now EOL stable as longer term
This is really just a pedantic criticism on the handling of 6.0, and the 'ignorance' (I hate the connotation of the term) of why people don't run latest.
I'm not asking them to bend over backwards, here.
Things are going more or less the way I want, 6.0 will get fixed [edit: upstream]. Please don't take this the wrong way.
Why aren't the processes either manual or automated in place to check for things like this?
Aren't there some tests in place to check for such basic functionality errors?
Doesn't kernel development process mandate much facilities? I'm sure the NSA, Unit 8200 and GCHQ have tests like this in place but don't share their findings.
Is it a matter of funding or leadership philosophy and priorities?
IIRC the Linux maintainers view themselves as providing a kernel for distros to bundle.
You can get a kernel from Red Hat that has been through Red Hat's release process. Red Hat has their own test suite/labs and will also pay attention to test results from elsewhere - including Fedora, their evergreen distro for putting new software into the wild ahead of its incorporation into Red Hat Enterprise Linux.
Wondering whether there's a company out there that does kernel testing as a service. Give your kernel conf, some tunings of basic services, eventually your distro, and have an automatic testsuite run for your subset, cyclictests, syzkaller instance, have some of your stresstests app run. Might be useful in a world of firecracker/microvms with smaller kernel surfaces?
Every kernel subsystem has its own testsuite. Running all of them would requires hundreds of different pieces of hardware, so it's not really possible for a single release manager to do so.
For Linus's releases this is easily solved by slowing down progressively the pace of development towards a release, so that cross-subsystem issues where maintainer A breaks maintainer B's subsystem become progressively less likely over the two months of the release cycle.
For stable releases this is much harder to do because of the short cycle. The stable branches in the end are a mostly automated collection of patches based on both maintainer input and the output of a machine learning model. The quality of stable branches is generally pretty good, or screwups such as this one would not make a headline; but that's more a result of discipline of mainline kernel development, rather than a virtue of the stable kernel release process.
Yes, I agree that _this_ issue could have been found. But the parent was talking more in general of "doesn't the kernel have a test suite", and both hardware-dependent (drivers, profiling, virtualization, etc.) and hardware-independent (filesystem, networking, etc.) aspects of the kernel are distributed across multiple testsuites.
The stable kernels pre-release queue is posted periodically to the mailing list and subsystem maintainers _could_ run it through their tests, but honestly I don't believe that many do. Personally I prefer to err on the other side; unless something was explicitly chosen for stable kernel inclusion and applies perfectly, I ask the stable kernel maintainers to not bother include the commit. This approach also has disadvantages of course, so they still run their machine learning thingy and I approve/reject each commit that the bot flags for inclusion.
In my (limited) experience, the only reason to use backports is because you have closed source kernel modules which you can't update (that of course ends up covering most Android phones, and many SOCs)
Distros standardize on version not because it is more stable but because tooling (which might include 3rd party modules for the kernel) can then rely to work on that version without recompile.
If you don't have that constraint yeah, not much reason.
The vast majority of the time fixes like this that are being backported are straightforward fixes for bugs (security or otherwise) that require very little manual conflict resolution, especially if the fix is just being backported one or two kernel releases. The developer can often just cherry-pick the commit into a few recent release branches and most of the time git will just automatically do the merge correctly, or if there is a manual merge conflict it's something really simple. In fact, if there's a complicated merge conflict often the change won't be backported at all unless the bug is actually serious enough to warrant X hours of someone's time to do it and get code review etc. Most of the time this process works correctly, but obviously there's room for error and mistakes can happen.
There's a tradeoff between the risk of running an older kernel that has known bugs, upgrading to the latest new kernel which has bug fixes but may introduce new bugs, and getting backports for known bugs to your known working kernel. Most of the time the last option is reasonable but it definitely depends on your use case and what you're optimizing for.
It seems to me that for a long time now stable releases haven't been trying very hard to follow the stated policy [1], in particular the parts that say
> It must fix a real bug that bothers people (not a, "This could be a problem..." type thing).
and
> It must fix a problem that causes a build error ([...]), an oops, a hang, data corruption, a real security issue, or some "oh, that’s not good" issue. In short, something critical.
[1]: https://docs.kernel.org/process/stable-kernel-rules.html