This is nothing new. I followed kernel development pretty closely about a decade ago and I was honestly very disappointed in what I saw. It's all about features, features, features, and there's almost no testing beyond "well, it works on my machine". It's kind of ironic that Linux has become the C++ of kernels, given Linus's feelings about C++: a giant pile of features, many of which nobody really understands, that can have some really weird corner case interactions.
I switched to FreeBSD as a result of my time reading LKML. The BSDs seem, at least to me, to be more designed rather than thrown together; but maybe I just haven't spent enough time watching their sausage get made.
Hopefully now that a senior kernel developer (Chinner) is saying some of these things publicly, things can get better. But a culture change of the magnitude needed will not come quickly.
I don't think it is surprise that Linux is about adding more features in an organic way, instead of being well thought like *BSDs.
However saying that no testing happens in Linux kernel is dishonest, to say at least: there is automated tests maintained by big corporations like LTP [1] or autotest [2], thousands of people run different versions of unstable/mainline kernels with different configurations, security researches does multiple tests like running fuzzers and reporting issues, multiple opensource projects run tests in current versions of Linux kernel (that in the end also serves as a test of kernel itself), etc etc.
Linux is basically the kind of project that is big enough and impactful enough that naturally gets testing for free from the community.
This has always been my problem with the "many eyes make all bugs shallow" theory. It's a random walk. The most common features and configurations will get tested over and over and over well beyond the point of usefulness. Less common features might not get tested at all. In fact it's worse than a random walk, because usage is even more concentrated in a few areas. Yes, the fuzzers etc. do find some bugs, but relative to the size and importance of the project itself they're smaller than they would be on most other kinds of projects.
There's just no substitute for rigorous unit and/or functional tests, constructed with an eye toward what can happen instead of just what commonly does happen in the most vanilla usage. Unfortunately, UNIX kernel development - not just Linux - has never been strong on that kind of testing. Part of the reason is the inherent difficulty of doing those sorts of tests on such a foundational component. Part of it is ignorance, which I mean as simply "not knowing" and not as an insult. Part of it is macho kernel-hacker culture. Whatever the reasons, it needs to change.
I have been in a few situations where I could have found bugs, because I was exploring deep behavior, but the lack of legible documentation on the expected behavior made me back off. I have no clue on what’s expected or not, the man pages are just a joke and lkml is mostly unreadable.
> The most common features and configurations will get tested over and over and over well beyond the point of usefulness. Less common features might not get tested at all.
Not that I wouldn't want a featured Unix-like kernel with a comprehensive[1] test suite (and I don't know any major OS that have this, not Windows, macOS or Linux), however I think this is fine.
Common setups works fine, uncommon configurations may have some problems, with possible workarounds. Most people will not run mainline kernels anyway, so a workaround is acceptable.
[1]: What I mean by comprehensive is basically having tests for every possible configuration, that is basically impossible anyway. Probably the closest thing we can get is formally proven OS, however I don't think we will even have a general purpose OS formally proven.
> tests for every possible configuration, that is basically impossible anyway
Agreed. Expecting 100% across the entire kernel would be totally unreasonable. OTOH, coverage could be better on a lot of components considered individually.
Any component as complex as XFS is going to have tons of bugs. I don't mean that as an insult. I was a filesystem developer myself for many years, until quite recently. It's the nature of the beast. The problem is how many of those bugs remain latent. All the common-case bugs are likely to be found and fixed pretty quickly, but that only provides a false sense of security. Without good test coverage, those latent less-common-case bugs start to reappear every time anything changes - as seems to have been the case here. That actually slows the development of new features, so even the MOAR FEECHURS crowd end up getting burned. Good testing is worth it, and users alone don't do good testing.
I think filesystems in kernel have automated tests, I know xfstests [1] exist, at least. And they exist exactly because filesystems bugs generally are critical: a filesystems bug generally means that someone will lose data.
[1]: different from what the name may suggest, xfstests is run in other filesystems too. Here is an example of xfstests ported to ZoL (ZFS on Linux): https://github.com/zfsonlinux/xfstests
Yes, xfstests exists. I've used it myself and it's actually pretty good for a suite that doesn't include error injection. But part of today's story is that xfstests wasn't being updated to cover the last several features that were added to XFS. The result is exactly the kind of brittleness that is characteristic of poorly tested software. Something else changed, and previously latent bugs started popping out of the rotten woodwork.
What’s crazy for me, after reading all this, is how wonderfully stable my Linux (kernel-level)[0] experience has been. I’ve never used any non-ext2/3/4 filesystems, granted, so I haven’t used this code, but I find it hard to believe that these findings are indicative of the code I have used on a relatively run-of-the-mill amd64 machine. So maybe if you’re like me, using a fairly standard distro with the official kernel on somewhat normal hardware, you would have the benefit of millions others testing the same code.
[0]: I have had more than my fair share of user land problems, but I have come to expect that on any platform.
Yeah I've also had good experiences with Linux reliability.
But that's because I intentionally stay on the "happy path" that's been tested by millions of others. I avoid changing any kernel settings and purposely choose bog-standard hardware (Dell).
When you're on the other side, you're not just maintaining the happy path. You're maintaining every path! And I'm sure it is unbelievably complex and frustrating to work with.
-----
Personally I would like software to move beyond "the happy path works" but that seems beyond the state of the art.
Over time you get trained not to do anything "weird" on your computer, because you know that say opening too many programs at once can cause a lockup. Or you don't want to aggressively move your mouse too much when doing other expensive operations. (This may be in user space or the kernel, but either way you're trained not to do it.)
There is another post that I can't find that is about "changing defaults". I used to be one of those people who tried to configure my system, but I've given up on that. The minute you have a custom configuration, you run into bugs, with both open source and commercial software.
The kernel has thousands of runtime and compile-time options, so I have no doubt that there are thousands upon thousands of bugs available for you to experience if you change them in a way that nobody else does. :)
> Or you don't want to aggressively move your mouse too much when doing other expensive operations. (This may be in user space or the kernel, but either way you're trained not to do it.)
Operant conditioning by software bugs is totally a thing, but for this particular example I was trained into exactly opposite behaviour. I do move my mouse a lot during very resource-intensive computations, because that lets me gauge the load on my system (is there UI animation lag? is there cursor movement lag?), and in extreme cases, it can tell me when there's time to do a hard reboot. I've also learned through experience that screensavers, auto-locking, and even auto-poweroff of the screen can all turn what was a long computation into forced reboot, so avoiding long inactivity periods is important.
This conditioning comes from me growing up with Windows, but I hear people brought up on Linux have their own reason - apparently it used to be the case (maybe it still is?) that some computations relying on PRNG would constantly deplete OS's entropy pool, and so just moving your mouse around would make those computations go faster.
Your lens is probably too small, Paul does a great job quantifying bug probability to more human terms in the above presentation.
I ran all OS dev and support in an environment with 10000 FreeBSD systems and 3000 Linux systems in high scale production. The ratio of kernel panics was similar (although Linux tended to exhibit additional less rigorous failure modes due to more usage of work queues and other reasons). You could expect at least a couple faults per day depending on the quality of hardware and how far off the beaten path your system usage is at this scale. The big benefit I found with FreeBSD is that I could understand and fix the bugs, and I was generally intimidated by doing so on Linux.
How do you define "throughly" anyway? Just like a comment in the original post,
>>But anything less than 100% coverage guarantees that some part of the code is not tested...
> And anything less than 100% race coverage similarly guarantees a hole in your testing. As does anything less than 100% configuration-combination coverage. As does anything less than 100% input coverage. As does anything less than 100% hardware-configuration testing. As does ...
It is combinatorially impossible to "thoroughly" test something as large and complex as Linux kernel. Other than e.g. sqlite I struggle to think of a single substantial piece of software that is truly thoroughly tested.
I don't know why you'd be surprised by this. The top kernel on kernel.org's front page is literally just Linus's personal tree. "Upstream" is context-sensitive but "mainline" has always been Linus's personal tree.
In reality, what matters to most users is what goes in to Debian stable or RHEL, and those kernels are far from just "works for me" tested.
Linux has always been treated like it's Linus's personal project. That's why more than a few of us believed that Linus taking a 'break' a few months ago was the beginning of the end. It turns out that not much changed, but the culture is still there.
If FreeBSD had received a fraction of the resources Linux has received over the last decades, it would have its own share of scars left by the growing pains.
For years FreeBSD development was very stagnant while Linux exploded in popularity and support. The comparison is silly.
Having sent the kernel maintainers a patch adding support for hardware they don't have (in my case, it was a network card): they merge the code after just reading over the code and checking it compiles. Keep in mind, however, that they usually have decades of experience in their particular area of the kernel, so they often can tell at a glance when you're doing something wrong or unusual.
In practice, such code dumps usually get rejected out of hand because most unknown first-time contributors are unable to produce a fully functional driver without violating the coding practices of that subsystem and re-inventing the wheel in some fashion. The maintainers always prefer companies to do their driver development in a more open fashion so that there can be two-way feedback throughout the process to ensure the end result is something worthy of merging. Developers who play by the rules and ensure that their driver both fits into existing infrastructure and doesn't break anything else are usually developers who can be trusted to deliver code that works on the new hardware that's not yet public.
From my experience dealing with upstreaming, it has become more of some sort of priesthood that is required to go thru the proper mantras and speak the proper language. And most of the priest no longer have any real idea of what's the real world is doing with the kernel.
I've spent a few months upstreaming a subsystem earlier last year, something that had been tested in the field, at customers, and I had to literally gut it to fit the priesthood's way of doing things. Ultimately the result that went into the kernel was technically /inferior/ to the original source, and any rant about /that/ was pointedly ignored.
I'm not even going to mention the device tree binding, which has become as bad as the high days of XML where the format took a life of it's own and requires it's own maintainers. It's completely bonkers.
I think since a lot of maintainers became 'professionals' they no longer use linux. They just juggle patches all day and talk between themselves and their clique. And as long as it fits the big tech company that pay them, it doesn't matter if it's actually /useful/ to anyone else.
"We trusted people" is a classic denial of responsibility. The people shoving in the new-feature changes helped create the problem. It's great that at least one of them seems to have had a change of heart, but starting by blaming (unspecified) others suggests that the change might not last long. Expect reversion to form in a few months.
There was one bug for a while where IPSec would not handle TCP packets[0]. This was a big one since sending TCP packets over an IPSec tunnel is a somewhat common scenario.
I had to keep an old kernel for quite a while before that one was fixed.
It seems that a lot of the quality control happens with the distributions, not with the upstream software itself. I doubt that SLES would have seen this bug, but because I was running Tumbleweed I have to expect breakage like that.
I did some fuzzing on filesystem tools a few years ago. xfsprogs was... "interesting".
Interesting in that it was seemingly impossible to find anyone to report bugs to. They had a bug tracker, but it didn't work, submitting bugs resulted in an error. I think there was also a mail address that bounced.
I think in the end my bug reports reached noone who could care about them.
There is xfs.org and the XFS mailing list. You can just send an email there. Maybe you were sending HTML emails or something that made the message bounce?
I.e. if you can’t sing, then you can’t point out that the singer is out of tune, therefore they aren’t out of tune, and everything’s fine?
Submitting a request which says “can you clarify if this filesystem behaviour is expected?” then bringing it up two years later saying “this has been unstable for a while” does not seem like “complaining” or “beating a dead horse”.
Users complaining that things are broken < Users complaining that things are broken and then submitting a patch (or at least some attempt to unbreak things)
It's exceedingly rare that a maintainer of anything would scoff at all patches from users to help fix things.
I switched to FreeBSD as a result of my time reading LKML. The BSDs seem, at least to me, to be more designed rather than thrown together; but maybe I just haven't spent enough time watching their sausage get made.
Hopefully now that a senior kernel developer (Chinner) is saying some of these things publicly, things can get better. But a culture change of the magnitude needed will not come quickly.