Sometimes it actually is a kernel bug: bind() in Linux 6.0.16

mjw1007 · on Jan 13, 2023

Allegedly this was the patch introducing the bug: https://lore.kernel.org/stable/20221228144337.512799851@linu...

It seems to me that for a long time now stable releases haven't been trying very hard to follow the stated policy [1], in particular the parts that say

> It must fix a real bug that bothers people (not a, "This could be a problem..." type thing).

and

> It must fix a problem that causes a build error ([...]), an oops, a hang, data corruption, a real security issue, or some "oh, that’s not good" issue. In short, something critical.

[1]: https://docs.kernel.org/process/stable-kernel-rules.html

ploxiln · on Jan 13, 2023

It's a long-standing issue/debate at this point. The maintainers of long-term stable kernels have for some years now used scripts to automatically pick as many appropriate-looking stand-alone patches as possible, looking at what's merged into Linus' "tip" tree (usually to find fixes targeting rc(N) kernels). Since they do the work they get to decide, but many others have found this an inappropriately regression-prone way to manage longterm-stable kernels. If you want all the latest fixes and improvements, you can use the latest kernel. If you use the longterm-stable it's because you want less risk from unrelated/unmotivated changes, you want only the specific fixes found to be most needed.

Well, that's all in my interpretation :) for more broad background see https://lwn.net/Articles/863505/

mjw1007 · on Jan 13, 2023

I have no strong opinion on what the policy should be, but I think there's no excuse for allowing the stated policy to get so far out of sync with the actual policy.

That is, I think the stable-release maintainers should update Documentation/process/stable-kernel-rules.rst so that it tells the truth.

(I think this is about normal stable kernels, not 'longterm' ones. I don't think 6.0 was expected to become the next LTS release.)

ploxiln · on Jan 13, 2023

Yeah, good point, in any case I would expect the "most recent stable" branch to get _all_ the "potential fixes" (that otherwise would only be available in RC releases).

bravetraveler · on Jan 13, 2023

You know, this is a really good point.

The last three 'stable' releases have contained an annoying number of refactors and fixes for amdgpu in particular

Play with PowerPlay tables in staging, I haven't been able to upgrade

bravetraveler · on Jan 13, 2023

I can't edit now but a note for posterity... when I say 'staging' I actually suggest mainline or linux-next, whichever is actually more appropriate

Stable seems an odd place /shrug

Those 'fixes' may have fixed what they looked at, but I'm still broken. Feels more like rearranging chairs than actually fixing

77pt77 · on Jan 13, 2023

That's an absolutely awful policy!

It should fix an actual problem, even if not found until now.

CodesInChaos · on Jan 13, 2023

Why do you think fixes for minor problems should be backported to stable kernels?

77pt77 · on Jan 13, 2023

Because many times minor problems are not minor at all.

Test the damn thing!

willcipriano · on Jan 13, 2023

Is this the sort of thing you feel is worth someone else's time and effort, or do you think it's meaningful enough to consume some of your own? If the latter you can offer to do it, if the former you can pound sand.

CodesInChaos · on Jan 13, 2023

The issue isn't effort here, but that each backported fix risks breaking something else. People who want all the latest fixes can choose the latest kernel, instead of relying on backports to stable kernels.

koala_man · on Jan 13, 2023

> it's never a compiler bug

I wish this was true. It would have made my job of writing compilers so much easier.

lostmsu · on Jan 14, 2023

I think you misunderstand. It is never the bug in the compiler of your compiler.

moloch-hai · on Jan 13, 2023

Out of all the compiler bugs I have failed reports against over the years, only one was fixed. And the fix was applied only in the "trunk" branch, not any of the other ones still "maintained".

I learned that Gcc has a policy not to fix performance regressions on any but the development branch.

Back in the late '80s we spent fully half of each working day tracking down the trigger and workaround for cfront crashes. Kids today have it easy.

helmsb · on Jan 13, 2023

“ There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible.”

I say, it’s “rarely a compiler bug.” When teaching new developers I remind them that you should assume the bug is in the code you just wrote before jumping to the bug being a compiler, framework or OS issue. It’s just a matter of probabilities.

plumefar · on Jan 13, 2023

In my +- 25 years of experience, I've encountered exactly 1 compiler bug. That cost me a full week of debug. Unfortunately, it happened during my very first year as a developer, and that triggered sub-optimal debugging practices (ie: is that the compiler fault again?)

At the end of the day, you want to optimise for debug time. There's the probability that it's a compiler/developer bug (respectively very low/very high) and the time it takes to rule it out. It's of course best NOT to start by investigating the compiler bug.

Agingcoder · on Jan 13, 2023

I find several every year. Most people simply don't see compiler bugs because they don't test (and most of the time don't need to) their compilers extensively enough (by testing I mean they must build, run the generated binaries, and compare the output ).

karamanolev · on Jan 13, 2023

That sounds very weird, unless you are working

1) With with an esoteric environment that has relatively few users.

2) Have specific objectives that require a lot of edge case testing or ridiculously thorough fuzzing of the binary.

3) Go out looking for them by crafting nifty things that aren't much used.

Otherwise it would imply large companies (Google-scale) experience thousands of compiler bugs per year...

Agingcoder · on Jan 14, 2023

They most likely do (just look at the gcc bugzilla).

I find them because of a combination of : - a very large codebase - compiling with optimizations (O3) and targeting recent archs - yearly compiler upgrades - an extremely extensive testsuite

I have found all kinds of bugs (frontend, middle, backend) - the codegen ones tend to be very nasty to diagnose.

addaon · on Jan 13, 2023

> Otherwise it would imply large companies (Google-scale) experience thousands of compiler bugs per year...

This seems likely to be true, from my experience.

Fortunately, most compiler bugs I run into (mostly with gcc and llvm) are not code generation bugs (which can eat months of debugging), but just segfaults / rejecting correct code / other broken stuff.

josephg · on Jan 14, 2023

I also suspect most programmers spend most of their time staying more or less on the “straight and narrow”. There’s lots of obscure features in most languages (especially in old languages like C++) that not many programmers use.

These features are much more likely to have buggy edge cases in the compiler. And only a small select group of programmers will run into those bugs over and over again.

Eg complex template metaprogramming - which had known broken corner cases in all C++ compilers up until a decade or so ago.

astrange · on Jan 13, 2023

If you don't hit compiler bugs, it's mostly because you're not updating your compiler.

Large companies with compiler teams want to use them, so they do update their compiler, so they will find bugs in it.

Btw, what do you think this prints with clang? (Whether the answer is a compiler bug is debatable…)

   printf("%#x\n", 1 << 32);

valleyer · on Jan 13, 2023

Assuming int is 32 bits wide, `1 << 32` invokes UB.

astrange · on Jan 13, 2023

Indeed it does.

(Spoiler: it prints uninitialized stack memory. A bit closer to demons flying out your nose than you'd expect!)

valleyer · on Jan 14, 2023

Fun. Basically, it decides that (due to UB) it can simply not materialize the argument and carry on.

On Darwin/x86_64, this actually means that it prints out bits from an "uninitialized" register, specifically %rsi in this case, since that's where the second argument should be (even for variadic functions).

Certainly a jarring failure mode! However, if updating your compiler causes you to run into this, you already have a bug in your code (one that clang -- arguably not forcefully enough -- already warns you of).

Agingcoder · on Jan 14, 2023

This is exactly when I find most of my compiler bugs : every time we upgrade.

NewJazz · on Jan 13, 2023

Yeah I was debugging an issue that was occurring in a Kubernetes environment last night... kept wanting to blame the out of date Istio version. I just could not trust it and that lack of trust stopped me from seeing the whitespace error that was in front of me the whole time...

bondant · on Jan 13, 2023

I think it's really dependent of the domain you are working in. Some stuff is better tested than others. In my 5 years careers I found bugs (found and confirmed by compiler authors) in gfortran, xlf and pgfortran. So yes bugs in compilers are rare but sometimes you just keep running into them.

dtech · on Jan 13, 2023

those are quite niche languages though, most people use compilers which are much more widely used.

capitalsigma · on Jan 13, 2023

Clang support for AVX-512 intrinsics is pretty broken

ilyt · on Jan 13, 2023

They say that because people that don't have enough knowledge yet to even check whether it's compiler/tooling bug will blame compiler/tooling

CodesInChaos · on Jan 13, 2023

In C it's usually undefined behaviour that causes the compiler to produce code that doesn't behave as expected. So not actually a compiler bug, but indistinguishable from one until you identify the UB.

cormacrelf · on Jan 13, 2023

I’ve found two compiler bugs, both in rustc. One was a hang, so obvious. But the other, I and the people chiming in pretty quickly figured out it was a codegen problem, as it turned out due to enabling noalias before it was ready. Just cut swaths of code until you get it down to no unsafe. It was a pretty good experience.

touisteur · on Jan 13, 2023

Delta debugging and C-reduce are your friends there. I dearly miss creduce in my language of choice... Not that compiler bugs are that frequent but when generating so much weird code I tend to trigger them bugs more frequently.

theLiminator · on Jan 13, 2023

I've found that as I get increasingly experienced/skilled, that ratio of the problem being in my code vs compiler, framework or OS issue decreases.

Not that most problems still aren't in my code, but I've increasingly run into framework/library level issues and very rarely compiler issues. Can't say I've triggered any kernel issues yet.

jdwithit · on Jan 13, 2023

Doing zero troubleshooting and immediately jumping to the conclusion that it's someone else's fault is frustratingly common. If you want to make a network engineer reflexively reach for a bottle of liquor, DM them on Slack "hey is the network down my app/service/database/whatever is unreachable".

Networks can and do go down of course. But in my experience, the vast majority of these issues are actually the result of something extremely mundane like a typo in a hostname. I've resolved an incredible number of issues over the years by simply reading the error message someone sent to me and asking them to check the exact thing that the message says is wrong.

nubinetwork · on Jan 13, 2023

https://bugs.gentoo.org/724314

Compiler bug or CPU issue, take your pick :)

asah · on Jan 14, 2023

Rarely yes. Never, haha not my experience. I've seen bugs in EVERYTHING, and let's not forget hardware bugs like pentium math.

sph · on Jan 13, 2023

Not the first annoying networking bug introduced in 6.0.16 — if you have a CIFS mount, you get a kernel panic. There's a confirmed bug report about it on Red Hat's and the kernel's bugzilla. They don't seem to be related, so it's just coincidence.

dmarlow · on Jan 13, 2023

How is testing done in kernels? Is there unit testing, integration, end to end? I'm unfamiliar, but curious.

seanhunter · on Jan 13, 2023

Not in the Linux kernel. In the Linux kernel it has traditionally always been done by the developer sticking lots of printk statements into their code until they are happy with the outputs and then removing most of them. Make of that what you will.

bonzini · on Jan 13, 2023

LOL that is so false that is not even worth correcting.

marcellus23 · on Jan 13, 2023

Actually could you correct it? I don't know anything about the Linux kernel so I'm curious.

seanhunter · on Jan 14, 2023

I'll correct myself. That was how it was done circa 2000-2005, but my info was way out of date because I haven't followed kernel development much for a while. Here's some stuff about how it's done now. https://www.kernel.org/doc/html/latest/dev-tools/testing-ove...

Basically there is a system called KUnit for writing white box unit tests and there is code coverage to determine coverage as you would expect, then there are systems for static analysis and verifying assertions.

vfclists · on Jan 13, 2023

This is so wrong!!

post-it · on Jan 13, 2023

As in that's not how it's done, or that's not how it should be done?

pikrzyszto · on Jan 13, 2023

See the following

https://kunit.dev/ - unit tests for kernel

https://docs.kernel.org/dev-tools/testing-overview.html - entire testing overview.

nomel · on Jan 14, 2023

There seems to be a missing piece here. The change that created this includes no tests. It appears to be completely untested.

jeffbee · on Jan 13, 2023

As you can see from the patch that introduced the bug, the kernel is severely under-tested and has little to no testing culture among the core contributors. The patch changes logic but no tests, indicating that the changed code has insufficient test coverage.

xtreak29 · on Jan 13, 2023

https://stackoverflow.com/q/3177338

tedunangst · on Jan 13, 2023

It's encouraging to see so many new volunteers signing up to help maintain and test the stable branch!

nikanj · on Jan 13, 2023

The biggest mindset adjustment I ever had to make was moving from C++ to Node.js.

You can usually trust that select() is not broken, but select.js is deprecated, select-kitchensink.js is not compatible with your toolchain and unicyclect.ts leaks memory like a sieve

netheril96 · on Jan 14, 2023

For curiosity I upgraded the kernel of my VPS running Debian 11 from 5.10 to 6.1. The networking became visibly slower. All TCP receive and send rate droped to at least 1/4 of the original. I had to restore the original kernel and the network became fine again.

I don't think it is a bug in the kernel. More likely the userspace or the host system needs to be reconfigured to work well with the bleeding edge kernel. Still I am curious, what exactly may cause this dramatic drop?

yencabulator · on Jan 14, 2023

> it's never a compiler bug

Back in the days of ISO-8859-1, you could make gcc crash by having the string literal "ä" in the source code...

vfclists · on Jan 13, 2023

How did this happen?

Which commit caused it?

smashed · on Jan 13, 2023

Best explanation I found is here:

https://lore.kernel.org/stable/CAFsF8vL4CGFzWMb38_XviiEgxoKX...

A patch was backported to the 6.0 branch from the main branch, but they forgot a line of code, leading to a buggy behavior.

bravetraveler · on Jan 13, 2023

I'll open by saying I'll forever be thankful for GKH... but this response kills me:

> As 6.0.y is now end-of-life, is there anything keeping you on that kernel tree?

Uh, several distributions. It wasn't EOL enough to prevent breaking it, so fix it.

Don't even technically need their input, Git and all.

I'll buy this EOL thing if they revert the change that caused this and stop releasing under 6.0. There were at least two more after this

Arnavion · on Jan 13, 2023

>Uh, several distributions.

That is the kernel bug tracker, not the distributions bug tracker.

>It wasn't EOL enough to prevent breaking it, so fix it.

It wasn't EOL at the time the patch was backported. It's EOL now.

>I'll buy this EOL thing if they revert the change that caused this and stop releasing under 6.0. There were at least two more after this

Not sure what "this" in "two more after this" is, but there have been no 6.0 releases since it was EOLed.

bravetraveler · on Jan 13, 2023

> That is the kernel bug tracker, not the distributions bug tracker.

The point is that the contributors reasoning for being on the tree is irrelevant. Like you said, they just made it EOL

Distributions are the continuous/constant answer as to why countless people will be. This isn't an ancient release, something from the grave.

Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?

Neither of these are particularly tenable. I'm glad GKH was willing to accept further changes to make it correct, but reverting is also applicable.

Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.

Arnavion · on Jan 13, 2023

>Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?

Correct.

>Neither of these are particularly tenable.

Yes they are.

>Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.

You're awfully confident about how things should work, even though you don't understand how they already work.

bravetraveler · on Jan 13, 2023

Distributions have their own release strategies to pick up what are now in the 6.1/6.2 trees. Some were cut at a bad time where they're treating the now EOL stable as longer term

This is really just a pedantic criticism on the handling of 6.0, and the 'ignorance' (I hate the connotation of the term) of why people don't run latest.

I'm not asking them to bend over backwards, here.

Things are going more or less the way I want, 6.0 will get fixed [edit: upstream]. Please don't take this the wrong way.

vfclists · on Jan 13, 2023

This soooo BASIC!!

Why aren't the processes either manual or automated in place to check for things like this?

Aren't there some tests in place to check for such basic functionality errors?

Doesn't kernel development process mandate much facilities? I'm sure the NSA, Unit 8200 and GCHQ have tests like this in place but don't share their findings.

Is it a matter of funding or leadership philosophy and priorities?

andrewf · on Jan 13, 2023

IIRC the Linux maintainers view themselves as providing a kernel for distros to bundle.

You can get a kernel from Red Hat that has been through Red Hat's release process. Red Hat has their own test suite/labs and will also pay attention to test results from elsewhere - including Fedora, their evergreen distro for putting new software into the wild ahead of its incorporation into Red Hat Enterprise Linux.

Substitute the distro of your choice.

touisteur · on Jan 13, 2023

Wondering whether there's a company out there that does kernel testing as a service. Give your kernel conf, some tunings of basic services, eventually your distro, and have an automatic testsuite run for your subset, cyclictests, syzkaller instance, have some of your stresstests app run. Might be useful in a world of firecracker/microvms with smaller kernel surfaces?

cdelsolar · on Jan 13, 2023

yeah right? doesn't the kernel have a test suite?

bonzini · on Jan 13, 2023

Every kernel subsystem has its own testsuite. Running all of them would requires hundreds of different pieces of hardware, so it's not really possible for a single release manager to do so.

For Linus's releases this is easily solved by slowing down progressively the pace of development towards a release, so that cross-subsystem issues where maintainer A breaks maintainer B's subsystem become progressively less likely over the two months of the release cycle.

For stable releases this is much harder to do because of the short cycle. The stable branches in the end are a mostly automated collection of patches based on both maintainer input and the output of a machine learning model. The quality of stable branches is generally pretty good, or screwups such as this one would not make a headline; but that's more a result of discipline of mainline kernel development, rather than a virtue of the stable kernel release process.

vfclists · on Jan 13, 2023

> Running all of them would requires hundreds of different pieces of hardware, so it's not really possible for a single release manager to do so.

The issue is this bug is not hardware related. Its a pure software issue.

Hardware bugs are an entirely different kettle of fish.

BTW is that bonzini of GNU Smalltalk fame?

bonzini · on Jan 13, 2023

Yes, I agree that _this_ issue could have been found. But the parent was talking more in general of "doesn't the kernel have a test suite", and both hardware-dependent (drivers, profiling, virtualization, etc.) and hardware-independent (filesystem, networking, etc.) aspects of the kernel are distributed across multiple testsuites.

The stable kernels pre-release queue is posted periodically to the mailing list and subsystem maintainers _could_ run it through their tests, but honestly I don't believe that many do. Personally I prefer to err on the other side; unless something was explicitly chosen for stable kernel inclusion and applies perfectly, I ask the stable kernel maintainers to not bother include the commit. This approach also has disadvantages of course, so they still run their machine learning thingy and I approve/reject each commit that the bot flags for inclusion.

> BTW is that bonzini of GNU Smalltalk fame?

Yes it's me. :) Did we meet?

vfclists · on Jan 14, 2023

> Yes it's me. :) Did we meet?

I'm a fan of Smalltalk and I used to follow your development of GNU Smalltalk.

What's happened to it? It seems to have fallen by the wayside.

bonzini · on Jan 15, 2023

I got a job and a family. :)

yjftsjthsd-h · on Jan 13, 2023

So that might suggest that it's actually better to just track the latest version, rather than worrying about backports?

CJefferson · on Jan 13, 2023

In my (limited) experience, the only reason to use backports is because you have closed source kernel modules which you can't update (that of course ends up covering most Android phones, and many SOCs)

touisteur · on Jan 13, 2023

There's also official support of vmm things like firecracker, which officially supports only 5.10 and maybe latest but don't send bugs?

ilyt · on Jan 13, 2023

Distros standardize on version not because it is more stable but because tooling (which might include 3rd party modules for the kernel) can then rely to work on that version without recompile.

If you don't have that constraint yeah, not much reason.

eklitzke · on Jan 13, 2023

The vast majority of the time fixes like this that are being backported are straightforward fixes for bugs (security or otherwise) that require very little manual conflict resolution, especially if the fix is just being backported one or two kernel releases. The developer can often just cherry-pick the commit into a few recent release branches and most of the time git will just automatically do the merge correctly, or if there is a manual merge conflict it's something really simple. In fact, if there's a complicated merge conflict often the change won't be backported at all unless the bug is actually serious enough to warrant X hours of someone's time to do it and get code review etc. Most of the time this process works correctly, but obviously there's room for error and mistakes can happen.

There's a tradeoff between the risk of running an older kernel that has known bugs, upgrading to the latest new kernel which has bug fixes but may introduce new bugs, and getting backports for known bugs to your known working kernel. Most of the time the last option is reasonable but it definitely depends on your use case and what you're optimizing for.