Hacker News new | past | comments | ask | show | jobs | submit login
Sometimes it actually is a kernel bug: bind() in Linux 6.0.16 (utcc.utoronto.ca)
162 points by zdw on Jan 13, 2023 | hide | past | favorite | 75 comments



Allegedly this was the patch introducing the bug: https://lore.kernel.org/stable/20221228144337.512799851@linu...

It seems to me that for a long time now stable releases haven't been trying very hard to follow the stated policy [1], in particular the parts that say

> It must fix a real bug that bothers people (not a, "This could be a problem..." type thing).

and

> It must fix a problem that causes a build error ([...]), an oops, a hang, data corruption, a real security issue, or some "oh, that’s not good" issue. In short, something critical.

[1]: https://docs.kernel.org/process/stable-kernel-rules.html


It's a long-standing issue/debate at this point. The maintainers of long-term stable kernels have for some years now used scripts to automatically pick as many appropriate-looking stand-alone patches as possible, looking at what's merged into Linus' "tip" tree (usually to find fixes targeting rc(N) kernels). Since they do the work they get to decide, but many others have found this an inappropriately regression-prone way to manage longterm-stable kernels. If you want all the latest fixes and improvements, you can use the latest kernel. If you use the longterm-stable it's because you want less risk from unrelated/unmotivated changes, you want only the specific fixes found to be most needed.

Well, that's all in my interpretation :) for more broad background see https://lwn.net/Articles/863505/


I have no strong opinion on what the policy should be, but I think there's no excuse for allowing the stated policy to get so far out of sync with the actual policy.

That is, I think the stable-release maintainers should update Documentation/process/stable-kernel-rules.rst so that it tells the truth.

(I think this is about normal stable kernels, not 'longterm' ones. I don't think 6.0 was expected to become the next LTS release.)


Yeah, good point, in any case I would expect the "most recent stable" branch to get _all_ the "potential fixes" (that otherwise would only be available in RC releases).


You know, this is a really good point.

The last three 'stable' releases have contained an annoying number of refactors and fixes for amdgpu in particular

Play with PowerPlay tables in staging, I haven't been able to upgrade


I can't edit now but a note for posterity... when I say 'staging' I actually suggest mainline or linux-next, whichever is actually more appropriate

Stable seems an odd place /shrug

Those 'fixes' may have fixed what they looked at, but I'm still broken. Feels more like rearranging chairs than actually fixing


That's an absolutely awful policy!

It should fix an actual problem, even if not found until now.


Why do you think fixes for minor problems should be backported to stable kernels?


Because many times minor problems are not minor at all.

Test the damn thing!


Is this the sort of thing you feel is worth someone else's time and effort, or do you think it's meaningful enough to consume some of your own? If the latter you can offer to do it, if the former you can pound sand.


The issue isn't effort here, but that each backported fix risks breaking something else. People who want all the latest fixes can choose the latest kernel, instead of relying on backports to stable kernels.


> it's never a compiler bug

I wish this was true. It would have made my job of writing compilers so much easier.


I think you misunderstand. It is never the bug in the compiler of your compiler.


Out of all the compiler bugs I have failed reports against over the years, only one was fixed. And the fix was applied only in the "trunk" branch, not any of the other ones still "maintained".

I learned that Gcc has a policy not to fix performance regressions on any but the development branch.

Back in the late '80s we spent fully half of each working day tracking down the trigger and workaround for cfront crashes. Kids today have it easy.


“ There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible.”

I say, it’s “rarely a compiler bug.” When teaching new developers I remind them that you should assume the bug is in the code you just wrote before jumping to the bug being a compiler, framework or OS issue. It’s just a matter of probabilities.


In my +- 25 years of experience, I've encountered exactly 1 compiler bug. That cost me a full week of debug. Unfortunately, it happened during my very first year as a developer, and that triggered sub-optimal debugging practices (ie: is that the compiler fault again?)

At the end of the day, you want to optimise for debug time. There's the probability that it's a compiler/developer bug (respectively very low/very high) and the time it takes to rule it out. It's of course best NOT to start by investigating the compiler bug.


I find several every year. Most people simply don't see compiler bugs because they don't test (and most of the time don't need to) their compilers extensively enough (by testing I mean they must build, run the generated binaries, and compare the output ).


That sounds very weird, unless you are working

1) With with an esoteric environment that has relatively few users.

2) Have specific objectives that require a lot of edge case testing or ridiculously thorough fuzzing of the binary.

3) Go out looking for them by crafting nifty things that aren't much used.

Otherwise it would imply large companies (Google-scale) experience thousands of compiler bugs per year...


They most likely do (just look at the gcc bugzilla).

I find them because of a combination of : - a very large codebase - compiling with optimizations (O3) and targeting recent archs - yearly compiler upgrades - an extremely extensive testsuite

I have found all kinds of bugs (frontend, middle, backend) - the codegen ones tend to be very nasty to diagnose.


> Otherwise it would imply large companies (Google-scale) experience thousands of compiler bugs per year...

This seems likely to be true, from my experience.

Fortunately, most compiler bugs I run into (mostly with gcc and llvm) are not code generation bugs (which can eat months of debugging), but just segfaults / rejecting correct code / other broken stuff.


I also suspect most programmers spend most of their time staying more or less on the “straight and narrow”. There’s lots of obscure features in most languages (especially in old languages like C++) that not many programmers use.

These features are much more likely to have buggy edge cases in the compiler. And only a small select group of programmers will run into those bugs over and over again.

Eg complex template metaprogramming - which had known broken corner cases in all C++ compilers up until a decade or so ago.


If you don't hit compiler bugs, it's mostly because you're not updating your compiler.

Large companies with compiler teams want to use them, so they do update their compiler, so they will find bugs in it.

Btw, what do you think this prints with clang? (Whether the answer is a compiler bug is debatable…)

   printf("%#x\n", 1 << 32);


Assuming int is 32 bits wide, `1 << 32` invokes UB.


Indeed it does.

(Spoiler: it prints uninitialized stack memory. A bit closer to demons flying out your nose than you'd expect!)


Fun. Basically, it decides that (due to UB) it can simply not materialize the argument and carry on.

On Darwin/x86_64, this actually means that it prints out bits from an "uninitialized" register, specifically %rsi in this case, since that's where the second argument should be (even for variadic functions).

Certainly a jarring failure mode! However, if updating your compiler causes you to run into this, you already have a bug in your code (one that clang -- arguably not forcefully enough -- already warns you of).


This is exactly when I find most of my compiler bugs : every time we upgrade.


Yeah I was debugging an issue that was occurring in a Kubernetes environment last night... kept wanting to blame the out of date Istio version. I just could not trust it and that lack of trust stopped me from seeing the whitespace error that was in front of me the whole time...


I think it's really dependent of the domain you are working in. Some stuff is better tested than others. In my 5 years careers I found bugs (found and confirmed by compiler authors) in gfortran, xlf and pgfortran. So yes bugs in compilers are rare but sometimes you just keep running into them.


those are quite niche languages though, most people use compilers which are much more widely used.


Clang support for AVX-512 intrinsics is pretty broken


They say that because people that don't have enough knowledge yet to even check whether it's compiler/tooling bug will blame compiler/tooling


In C it's usually undefined behaviour that causes the compiler to produce code that doesn't behave as expected. So not actually a compiler bug, but indistinguishable from one until you identify the UB.


I’ve found two compiler bugs, both in rustc. One was a hang, so obvious. But the other, I and the people chiming in pretty quickly figured out it was a codegen problem, as it turned out due to enabling noalias before it was ready. Just cut swaths of code until you get it down to no unsafe. It was a pretty good experience.


Delta debugging and C-reduce are your friends there. I dearly miss creduce in my language of choice... Not that compiler bugs are that frequent but when generating so much weird code I tend to trigger them bugs more frequently.


I've found that as I get increasingly experienced/skilled, that ratio of the problem being in my code vs compiler, framework or OS issue decreases.

Not that most problems still aren't in my code, but I've increasingly run into framework/library level issues and very rarely compiler issues. Can't say I've triggered any kernel issues yet.


Doing zero troubleshooting and immediately jumping to the conclusion that it's someone else's fault is frustratingly common. If you want to make a network engineer reflexively reach for a bottle of liquor, DM them on Slack "hey is the network down my app/service/database/whatever is unreachable".

Networks can and do go down of course. But in my experience, the vast majority of these issues are actually the result of something extremely mundane like a typo in a hostname. I've resolved an incredible number of issues over the years by simply reading the error message someone sent to me and asking them to check the exact thing that the message says is wrong.


https://bugs.gentoo.org/724314

Compiler bug or CPU issue, take your pick :)


Rarely yes. Never, haha not my experience. I've seen bugs in EVERYTHING, and let's not forget hardware bugs like pentium math.


Not the first annoying networking bug introduced in 6.0.16 — if you have a CIFS mount, you get a kernel panic. There's a confirmed bug report about it on Red Hat's and the kernel's bugzilla. They don't seem to be related, so it's just coincidence.


How is testing done in kernels? Is there unit testing, integration, end to end? I'm unfamiliar, but curious.


Not in the Linux kernel. In the Linux kernel it has traditionally always been done by the developer sticking lots of printk statements into their code until they are happy with the outputs and then removing most of them. Make of that what you will.


LOL that is so false that is not even worth correcting.


Actually could you correct it? I don't know anything about the Linux kernel so I'm curious.


I'll correct myself. That was how it was done circa 2000-2005, but my info was way out of date because I haven't followed kernel development much for a while. Here's some stuff about how it's done now. https://www.kernel.org/doc/html/latest/dev-tools/testing-ove...

Basically there is a system called KUnit for writing white box unit tests and there is code coverage to determine coverage as you would expect, then there are systems for static analysis and verifying assertions.


This is so wrong!!


As in that's not how it's done, or that's not how it should be done?


See the following

https://kunit.dev/ - unit tests for kernel

https://docs.kernel.org/dev-tools/testing-overview.html - entire testing overview.


There seems to be a missing piece here. The change that created this includes no tests. It appears to be completely untested.


As you can see from the patch that introduced the bug, the kernel is severely under-tested and has little to no testing culture among the core contributors. The patch changes logic but no tests, indicating that the changed code has insufficient test coverage.



It's encouraging to see so many new volunteers signing up to help maintain and test the stable branch!


The biggest mindset adjustment I ever had to make was moving from C++ to Node.js.

You can usually trust that select() is not broken, but select.js is deprecated, select-kitchensink.js is not compatible with your toolchain and unicyclect.ts leaks memory like a sieve


For curiosity I upgraded the kernel of my VPS running Debian 11 from 5.10 to 6.1. The networking became visibly slower. All TCP receive and send rate droped to at least 1/4 of the original. I had to restore the original kernel and the network became fine again.

I don't think it is a bug in the kernel. More likely the userspace or the host system needs to be reconfigured to work well with the bleeding edge kernel. Still I am curious, what exactly may cause this dramatic drop?


> it's never a compiler bug

Back in the days of ISO-8859-1, you could make gcc crash by having the string literal "ä" in the source code...


How did this happen?

Which commit caused it?


Best explanation I found is here:

https://lore.kernel.org/stable/CAFsF8vL4CGFzWMb38_XviiEgxoKX...

A patch was backported to the 6.0 branch from the main branch, but they forgot a line of code, leading to a buggy behavior.


I'll open by saying I'll forever be thankful for GKH... but this response kills me:

> As 6.0.y is now end-of-life, is there anything keeping you on that kernel tree?

Uh, several distributions. It wasn't EOL enough to prevent breaking it, so fix it.

Don't even technically need their input, Git and all.

I'll buy this EOL thing if they revert the change that caused this and stop releasing under 6.0. There were at least two more after this


>Uh, several distributions.

That is the kernel bug tracker, not the distributions bug tracker.

>It wasn't EOL enough to prevent breaking it, so fix it.

It wasn't EOL at the time the patch was backported. It's EOL now.

>I'll buy this EOL thing if they revert the change that caused this and stop releasing under 6.0. There were at least two more after this

Not sure what "this" in "two more after this" is, but there have been no 6.0 releases since it was EOLed.


> That is the kernel bug tracker, not the distributions bug tracker.

The point is that the contributors reasoning for being on the tree is irrelevant. Like you said, they just made it EOL

Distributions are the continuous/constant answer as to why countless people will be. This isn't an ancient release, something from the grave.

Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?

Neither of these are particularly tenable. I'm glad GKH was willing to accept further changes to make it correct, but reverting is also applicable.

Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.


>Is the expectation, then, that distributions would have to patch out the regression - or take on a more major upgrade (6.1 / 6.2), likely breaking something else?

Correct.

>Neither of these are particularly tenable.

Yes they are.

>Breaking something, calling it EOL, and not fixing it is closer to dead than end of life.

You're awfully confident about how things should work, even though you don't understand how they already work.


Distributions have their own release strategies to pick up what are now in the 6.1/6.2 trees. Some were cut at a bad time where they're treating the now EOL stable as longer term

This is really just a pedantic criticism on the handling of 6.0, and the 'ignorance' (I hate the connotation of the term) of why people don't run latest.

I'm not asking them to bend over backwards, here.

Things are going more or less the way I want, 6.0 will get fixed [edit: upstream]. Please don't take this the wrong way.


This soooo BASIC!!

Why aren't the processes either manual or automated in place to check for things like this?

Aren't there some tests in place to check for such basic functionality errors?

Doesn't kernel development process mandate much facilities? I'm sure the NSA, Unit 8200 and GCHQ have tests like this in place but don't share their findings.

Is it a matter of funding or leadership philosophy and priorities?


IIRC the Linux maintainers view themselves as providing a kernel for distros to bundle.

You can get a kernel from Red Hat that has been through Red Hat's release process. Red Hat has their own test suite/labs and will also pay attention to test results from elsewhere - including Fedora, their evergreen distro for putting new software into the wild ahead of its incorporation into Red Hat Enterprise Linux.

Substitute the distro of your choice.


Wondering whether there's a company out there that does kernel testing as a service. Give your kernel conf, some tunings of basic services, eventually your distro, and have an automatic testsuite run for your subset, cyclictests, syzkaller instance, have some of your stresstests app run. Might be useful in a world of firecracker/microvms with smaller kernel surfaces?


yeah right? doesn't the kernel have a test suite?


Every kernel subsystem has its own testsuite. Running all of them would requires hundreds of different pieces of hardware, so it's not really possible for a single release manager to do so.

For Linus's releases this is easily solved by slowing down progressively the pace of development towards a release, so that cross-subsystem issues where maintainer A breaks maintainer B's subsystem become progressively less likely over the two months of the release cycle.

For stable releases this is much harder to do because of the short cycle. The stable branches in the end are a mostly automated collection of patches based on both maintainer input and the output of a machine learning model. The quality of stable branches is generally pretty good, or screwups such as this one would not make a headline; but that's more a result of discipline of mainline kernel development, rather than a virtue of the stable kernel release process.


> Running all of them would requires hundreds of different pieces of hardware, so it's not really possible for a single release manager to do so.

The issue is this bug is not hardware related. Its a pure software issue.

Hardware bugs are an entirely different kettle of fish.

BTW is that bonzini of GNU Smalltalk fame?


Yes, I agree that _this_ issue could have been found. But the parent was talking more in general of "doesn't the kernel have a test suite", and both hardware-dependent (drivers, profiling, virtualization, etc.) and hardware-independent (filesystem, networking, etc.) aspects of the kernel are distributed across multiple testsuites.

The stable kernels pre-release queue is posted periodically to the mailing list and subsystem maintainers _could_ run it through their tests, but honestly I don't believe that many do. Personally I prefer to err on the other side; unless something was explicitly chosen for stable kernel inclusion and applies perfectly, I ask the stable kernel maintainers to not bother include the commit. This approach also has disadvantages of course, so they still run their machine learning thingy and I approve/reject each commit that the bot flags for inclusion.

> BTW is that bonzini of GNU Smalltalk fame?

Yes it's me. :) Did we meet?


> Yes it's me. :) Did we meet?

I'm a fan of Smalltalk and I used to follow your development of GNU Smalltalk.

What's happened to it? It seems to have fallen by the wayside.


I got a job and a family. :)


So that might suggest that it's actually better to just track the latest version, rather than worrying about backports?


In my (limited) experience, the only reason to use backports is because you have closed source kernel modules which you can't update (that of course ends up covering most Android phones, and many SOCs)


There's also official support of vmm things like firecracker, which officially supports only 5.10 and maybe latest but don't send bugs?


Distros standardize on version not because it is more stable but because tooling (which might include 3rd party modules for the kernel) can then rely to work on that version without recompile.

If you don't have that constraint yeah, not much reason.


The vast majority of the time fixes like this that are being backported are straightforward fixes for bugs (security or otherwise) that require very little manual conflict resolution, especially if the fix is just being backported one or two kernel releases. The developer can often just cherry-pick the commit into a few recent release branches and most of the time git will just automatically do the merge correctly, or if there is a manual merge conflict it's something really simple. In fact, if there's a complicated merge conflict often the change won't be backported at all unless the bug is actually serious enough to warrant X hours of someone's time to do it and get code review etc. Most of the time this process works correctly, but obviously there's room for error and mistakes can happen.

There's a tradeoff between the risk of running an older kernel that has known bugs, upgrading to the latest new kernel which has bug fixes but may introduce new bugs, and getting backports for known bugs to your known working kernel. Most of the time the last option is reasonable but it definitely depends on your use case and what you're optimizing for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: