Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Linux is 15+M lines of code....But at only 9,000 lines, Unix v6 was tractable, people continued to study it (and maybe still do), and the famous comment lived on.

> Reading real programs is a part of learning computer science that should be emphasized more. As an undergraduate, we mostly programmed in BCPL, a local Cambridge language that is the fore-runner of C and C++. We had access to the full source code of the compiler and reading it was as valuable as the more theoretical aspects we learned in our compiler-writing course. Today, with open source, it is easy to read the source code for many systems actually in use (Linux, the Apache web server, Hadoop, TensorFlow, and thousands more) but these are hundreds of thousands, if not millions, of lines of code. As I said above, Linux is somewhere between 15M and 20M lines of code, depending on just what you include.

This is an interesting problem. Modern software is so huge and complicated that it's not feasible to go read it just as a learning exercise. Linux devs can probably spend years contributing to the project without having even visited all of its dark corners, much less understanding them.

Not sure what can be done. Though I am tempted to go see what a 9,000 line OS looks like.



>Modern software is so huge and complicated (...) Not sure what can be done.

Nothing. Any sufficiently complex system or body of knowledge is too large for any one person to understand. Does any physicist know ALL physics that was ever done? Does any one engineer know how to build a Ferrari? Let's be honest, we take for granted a lot of stuff that we don't fully understand, Linux is a major piece of infrastructure, it's natural nobody can fully wrap their head around it.

> Reading real programs is a part of learning computer science that should be emphasized more.

I had never thought of this in an educational sense, but I definitely agree. I sure wish I was taught more of that, toy problems get boring quickly.


This. The strength of modern civilization is the ability to abstract. This means that you can let Joe deal with the frobnicator and Jane design with the fritzer and you just need to trust them enough that you can plug the two together and it works.

Joe and Jane don't need to understand what you are doing, and you don't need to understand what they are doing. Of course partial understanding, especially for related technology is helpful, but the farther away you get the less you know, and the more you need to rely on others.


Ahh... labor alienation. One of the great drawbacks and geniuses of our primary modes of production in modern society. As we become more and more separated from the things we actually work on, what are the drawbacks though? The general trend becomes if we want to do more, we create more separation. The social impact, for one, is in general a large extent of estrangement from our human essence. Tangibility is intuitive, abstraction, not so much.


I fear there is another potential pitfall to this. When you've heard "the Linux Kernel is 15M LOC" you might conclude "reading other code is unnatainable", as quite some commenters do in this thread. It is not: there's very neat OSS out there that you can pick apart in a spare hour and go "AHA!".

It might even span across to other niches. "Baking your own bread" is not that hard, if you get to it. But, becausethe entire process (fertilising ground, selecting seeds, growing grain, milling it) is too much for one person, lot's of people just give up alltogether and conclude that "today it is impossible to make your own food". Especially once you learn that bread mostly comes from large, complex industrial complexes today. Yet "just doing it" is very rewarding, as all the home-bakers will tell you.

And it spans across time as well: Sure, no-one can make a toaster from 0: mine own ore, distill your own plastics etc (there's a brilliant documentary from someone attempting just this, forgot its name). So people conclude that when the toaster stops working, there's nothing one can do, and they turn to Ali-express or Amazon and just buy a new one.

In short: I think labor alienation is toxic, in that it seeds itself into areas that don't need hyper specialisation.


I picked up this book in a bookstore years ago, still have it on my shelf.

The Linux Core Kernel Commentary [1]

It takes only the parts of the kernel you need to understand, leaving out the drivers, device classes, and much of te ancillary pieces. For the parts it chooses, it goes through them line by line and follows the boot process, explains the concepts of a Unix and SYSV-like kernel.

The 15M lines becomes much less intractible.

[1] https://books.google.com/books/about/Linux_Core_Kernel_Comme...


This comment feels like something written by GPT-3. What are you actually trying to say here?


GP is saying that hyper-specialization is intrinsically fragile.


What is fragile? We have had specialized roles for the entire history of society and it hasn't caused anything to crumble.

I suppose you could say many things have failed due to no one seeing the big picture among all the tiny details but without people focusing on the details none of that would have been possible to begin with.


> What is fragile? We have had specialized roles for the entire history of society and it hasn't caused anything to crumble.

I'm not a historian but I can imagine that a medieval specialist, such as for example blacksmith, could very much fix his own house (or maybe even build a new one), grow food, take care of animals, make wooden tools, make some simple herbal medicine etc. I'm guessing a lot of he was consuming was produced by himself and his family. Today, for engineers it's the total opposite - we only consume things produced by others, and essentially by strangers whom we've never met.

Regarding fragility specifically, if a major crisis comes (say a war), the desk jockeys, esp. ones didn't have the wisdom to accumulate resources, will fare much worse than someone who's say running a homestead. If they did accumulate resources (provided they don't become worthless by hiperinflation etc.), they will be able to buy the essentials and hope their stash lasts until the end of the crisis.


As things are getting more complex, more people become specialized, and there are a lot more interfaces between specialties that need to be properly defined so things don't break. The more interfaces mean a higher likelihood of a system going wrong. That, and I feel that as long as you have a good big picture, you have a higher fault tolerance because there is a general goal you can identify to strive towards when something does go wrong to correct it. Certainly, I feel that in any field that has become sufficiently complex, there is a practical human limit for a single person to understand the field as a whole. Even if someone can claim to understand a field as a whole in a general sense and how it's major components interact, they will still lose information because of how much they have to abstract.


The reality does not seem to match that though. Pretty much everything not born out of negative emotions is better than it was 10, 20, 30 years ago. Tons of jobs don't need to be done anymore, and more are becoming automated, which is quite often better than human beings's much stronger randomness.


What is fragile? We have had specialized roles for the entire history of society and it hasn't caused anything to crumble.

You should read Collapse Of Complex Societies by Professor Joseph Tainter.

Tl;dr a step change in complexity of ancient societies more often than not lead to their downfall. They were unable to train/retrain specialists quickly enough to deal with surprises.


I can't agree with that.

>> The social impact, for one, is in general a large extent of estrangement from our human essence. Tangibility is intuitive, abstraction, not so much.

Specialization does introduce fragility, but that's something you brought to the discussion, not something that was present in the comment. This seems more like the common complaint that working on abstract fuzzy intangible things is damaging to your soul.


Labor alienation is a Marx theory that private labor alienates the laborer from their humanity, their coworkers and the product they are ultimately making. It’s a high-minded way of saying “you’re a cog in the machine and that is very bad”.

Using it as a counter argument against specialization isn’t very insightful though. Not even Marx tried to argue that (to Marx specialization was only alienating when you did it for the benefit of the bourgeoisie). Specialization is one of the most fundamental components of a developed economy. If you look at any ancient agrarian society, they had barely any specialization compared to a modern developed economy, and if you track the economic development of any society, you’ll notice specialization ticks upward along with it.


There are concrete measurable benefits to the complex things we build. Improvements to quality of life, increased time for recreation, etc.

On the other hand, you make a fair point - tangible contribution to Things is probably built into us from our origin as tool-makers.

But isn't the gradual separation from direct contact with Made Things inevitable? At some point the body of knowledge required to make a meaningful contribution grows so large that learning it approaches the human lifespan. How can we avoid too much hyper-specialization in that case?


What if computing is less like building tangible products and more like reading and writing?


Since childhood, I have periodically tried to reason about when the latest moment in time was that a single human could have substantially all the knowledge that humans collectively possessed.

As a kid, it was an easier thing to reason about (wrongly). As an adult, I’ve concluded that it was never possible to both have something we’d think was civilization and for one person to know substantially everything humans knew.


You don't need civilization for this to be unfeasible. People tend to underestimate how much nomadic people know about things like animal behaviors, medical use of natural substances or about geography, especially with regard to the change of seasons. Arguably, it was never possible to know the entirety of human knowledge.


There are some people with a massive capacity for holding the entire system in their head. John Carmack only reached the point of "too much" during Rage, for example. Pretty impressive!


A good physicist has the ability to grasp a large fraction of all physics that has ever been done. And there is a pretty good directed knowledge graph. The action principle and some ideas from quantum mechanics are enough to anchor almost all other results.


Viewpoints Research Institute claims that they did all of personal computing (including antialiased graphics needed to do SVGs, TCP/IP and an office application which could run both word processing workloads and PowerPoint) in under 20,000 lines of code. (Usually TCP/IP itself is 10k LOC so this is pretty impressive.)

The basic idea that Alan Kay has presented in his many similar YouTube presentations (most of which run with that office application) is what we call today domain-specific languages, you write the rules for antialiasing a pixel in whatever the ideal language would be for those rules: you just imagine that you have a dream language and "wishful think" the solution: then you go and you reimplement that DSL in another language that you are wishfully conjuring, a language-language, and that final one can become self-describing and can compile and optimize itself.

This is also a method of programming advocated in MIT's old SICP lectures, which will also cover the implementation of Lisp in itself, which is sort of a prerequisite for it.

That all said, I would love to read the source code someday but I can't seem to find it anywhere, the software product may be the KSWorld that was described in [1].

[1] http://www.vpri.org/pdf/tr2013002_KSonward.pdf


Actually, we didn't claim to have successfully done this.

More careful reading of the proposal and reports will easily reveal this.

The 20,000 lines of code was a strawman, but we kept to it, and got quite a bit done (and what we did do is summarized in the reports, and especially the final report).

Basically, what didn't get finished was the "bottom-bottom" by the time the funding ran out.

I made a critical error in the first several years of the project because I was so impressed with the runable-graphical-math -- called Nile -- by Dan Amelang.

It was so good and ran so close to actual real-time speeds, that I wanted to give live demos of this system on a laptop (you can find these in quite a few talks I did on YouTube).

However, this committed Don Knuth's "prime sin" : "the root of all evil is premature optimization".

You can see from the original proposal that the project was about "runnable semantics of personal computing down to the metal in 20,000 lines of code" -- and since it was about semantics, the report mentioned that it might have to be run on a supercomputer to achieve the real-time needs.

This is akin to trying to find "the mathematical content" of a complex system with lots of requirements".

This is quite reasonable if you look at Software Engineering from the perspective of regular engineering, with a CAD<->SIM->FAB central process. We were proposing to do the CAD<->SIM part.

My error was to want more too early, and try to mix in some of the pragmatics while still doing the designs, i.e. to start looking at the FAB part, which should be down the line.

Another really fun sub-part of this project was Ian Piumarta's from scratch runnable TCP/IP in less than 200 lines of code (which included parsing the original RFC about TCP/IP.

It would be worth taking another shot at this goal, and to stick with the runnable semantics this time around.

And to note that if it wound up taking 100K lines of code instead of 20K, it would still be helpful. One of the reasons we picked 20K as a target was that -- at 50 lines to a page -- 20K lines is a 400 page book. This could likely be made readable ... whereas 5 such books would not be nearly as helpful for understanding.


Ah, I see, you are objecting that you did not complete what you were setting out to do. I think that is quite noble of you to admit but at least the demo presentations were quite impressive even if they lack a certain totality.

Well anyway, thanks a lot for these presentations. I don't think it's as good as a software course where someone breaks all my preconceptions of what computing is, to leave me truly free, but they are helping to widen my thoughts.


Final report (2012, released 2016) of the STEPS project, which has some more info, but sadly no reference to the source code... http://www.vpri.org/pdf/tr2012001_steps.pdf

Did some digging and found this site with some code, but I haven't looked closely enough to determine what it is specifically for... http://tinlizzie.org/dbjr/ And it is not exhaustive of what the PDF describes. Maybe someone is able to contact one of the original group working on this?


What is amusing about the VPRI project is that, even though the number of lines is low, nobody managed to replicated what they did.. Or more precisely, nobody was able to build something which work with the pieces of code they left. An impressive research but a poor software project, apparently they hadn't hear about SCM..


Though I am tempted to go see what a 9,000 line OS looks like.

Here you go: http://v6.cuzuco.com/

I believe most of the Linux kernel source is actually drivers, or otherwise code for which you don't have the hardware to use on, so the actual relevant amount of code in the Linux kernel to e.g. a typical x86 PC might be an order of magnitude less.


>Not sure what can be done. Though I am tempted to go see what a 9,000 line OS looks like.

I think you are going to like STEPS:

>The overall goal of STEPS is to make a working model of as much personal computing phenomena and user experience as possible in a very small number of lines of code (and using only our code). Our total lines of code target for the entire system -- from user down to the metal is 20,000, which we think will be a very useful model and substantiate one part of our thesis: that systems which use millions to hundreds of millions of lines of code to do comparable things are much larger than they need to be.

http://www.vpri.org/pdf/tr2012001_steps.pdf


Or Oberon, which is a top-to-bottom system that is described in a single book


> Modern software is so huge and complicated that it's not feasible to go read it just as a learning exercise.

I can't agree. Linux is certainly huge, and yes it's size means it's beyond any newbie's capacity to understand it all.

But nonetheless lots of newbies contribute to it regularly. Most of them build up the understanding they need by reading the kernel code. Obviously, that means you don't have to understand all of it to get a lot of benefit from just reading a small subsection.

So it is indeed possible to go read the Linux kernel just as a learning exercise. The key point is you don't have to read all of it to get a benefit out of it. Reading just a small subset works just as well as reading the entire source of a small project.


Yes, this is what some people seem to be missing here. Much of the Linux code implements drivers or subsystems that you might not be interested in to begin with (such as the network packet filter). Kernel code is highly modular. The code that implements trivial things like context switching is still much more complex than in an '80s Unix, but it is not millions of lines.


I put this up for someone on HN who asked about it:

https://jacquesmattheij.com/task.cc

3500 lines, micro kernels can be much smaller than that even.


> "Linux is 15+M lines of code"

But how much of it is the actual OS versus device drivers and other "extensions" of the main kernel? I assume there is somewhere a core that everything plugs into that has more or less the same functionality as what was done in the 9000 lines of the original Unix v6 code and only uses a fraction of all of the code of the kernel.


It's almost entirely (80%) drivers. Next heaviest is architecture specific code, then filesystems, networking, sound, documentation, and then the actual core of the kernel. A minimal build of linux uses about 300k lines of code.

https://unix.stackexchange.com/questions/223746/why-is-the-l...

So it's still substantially heavier than would be easy to study, but you could probably still fit the core into your head (I've been able to deeply understand systems with similar LOC counts, though there's probably substantial differences in 'density' between codebases)


A profiler will often reveal an interesting cross-section of the code that you can read first. Any way to get a stack trace can be a rudimentary profiler, just keep looking at stack traces until you see what comes up most commonly. If you've got good tools, you can get a map of the code you want to read first. Here is an old MySQL executing some queries: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#...

A debugger can also help. You can step through code as it runs. You can also often identify which code is associated with a feature, by putting breakpoints in likely places (dispatch functions!) and triggering the feature.


> > [...] at only 9,000 lines, Unix v6 was tractable, people continued to study it (and maybe still do), and the famous comment lived on.

Wow, Unix is a work of art. I have hobby projects at 2x SLoC which are literally just small web services :(


I wonder how many lines of code were behind the QNX demo disk?

http://toastytech.com/guis/qnxdemo.html


Kernel + include files: about 10Kloc

File system: about 15Kloc

Network driver + IP stack: about 10 Kloc

Task manager, posix compatible libc odds and ends, maybe another 50Kloc, Graphics driver: 10Kloc, window manager 2500 lines.

So not that bad, you could do it by yourself in about 2 years of hard work, probably less than that if you use a VM instead of actual hardware if you're a halfway competent programmer. I've done it.


I wonder the same about the GeoWorks DOS version of America Online


> Linux is 15+M lines of code (...) Modern software is so huge and complicated (...) Not sure what can be done.

Perhaps we need something more modular, like a microkernel architecture :)


We do actually. Not that it will happen, but we do. We need a hard real time micro kernel based OS for our personal computers, and not having it is a major blocker for a lot of applications that would then be trivial.


Okay, I'll bite: what would this enable, and why do you think we all need hard real time?


Hard real time implies that you get to choose what runs and when. It allows you to treat user input and output (arguably the most important aspect of using a computer) as a high priority affair and to guarantee that the user will never have to wait for their computer to respond to them. If there is one thing that irritates me about modern day computing then it is that it is in many ways slower (in spite of having 100 times the clock speed of a machine from the early 90's) than the box I had on my desk back then (which was running QnX).

There are exactly zero reasons why this is the case other than a stupid ego war between Andrew Tanenbaum and Linus Torvalds who was gung-ho on recreating an OS from the 70's.


If there is one thing that irritates me about modern day computing then it is that it is in many ways slower (in spite of having 100 times the clock speed of a machine from the early 90's) than the box I had on my desk back then

This irritates me as well, but blaming it on the kernel strikes me as absurd. Blame it on all the layers of abstraction and general lack of caring in applications instead.

It's true that there are cases where the Linux scheduler gets in the way of a good desktop experience, but desktop latency is too high even when the scheduler isn't a bottleneck. And if anything, good scheduling gets harder in a microkernel, not easier, because you have less insight into what is going on at a system level. (Ultimately, the distinction simply shouldn't matter though -- not for desktop experiences, that is.)


The kernel is where that problem originates, if you don't fix it there the higher layers have no chance at succeeding.


But that's just wrong. There are a lot of reasons for those input delays that have absolutely nothing to do with the kernel. A bloated Electron app will not suddenly be blazing fast just because it runs on a real-time kernel. The true issues with the modern software stack lie in userspace.


You are missing the point entirely, but that's fine, I don't have the time right now to debate this to the point where you will understand. But unless you have used a microkernel based OS with a desktop interface (and chances are that you haven't) it is probably better to try to find access to something like that first (say an older single disk QnX installation) and play around with it for a bit to understand what I'm getting at.

Yes, there are plenty of problems in user space. But those are not the root cause and without addressing the root cause you will not be able to solve the problem even if you don't have bloated software.


While I agree that hard real time is desirable, it isn't necessary for responsive interfaces. Plenty of older, comprehensive operating systems managed to be responsive without it, BeOS and the Amiga being examples.


Both BeOS and Amiga took a lot of leaves out of the book of realtime and their schedulers were top notch. So 80% there I would say. But to be able to have user processes with real time guarantees makes so many applications easier to write. For instance video players, midi sequencers, real time audio processing and so on. Right now we rely on ridiculously long buffers to ensure that we never run out of data but on a real time OS you could do this with minimal latency.


'The jury is in, the monolithic kernel has lost'


It's interesting to think about how modern languages help you squeeze more functionality into those 9000 LOC.


> modern languages

Lisps are hardly a modern invention. "Modern" languages do seem to be ever so gradually working their way towards full compile time metaprogramming though, gaining a great deal of expressiveness in the process.


I've found the series of books The Architecture of Open Source applications very helpful in this regard.


Apparently Minix3 is 6k and can run a lot of NetBSD on top. https://en.wikipedia.org/wiki/MINIX_3


There is so much driver and feature spam in the Linux codebase. I really wish a beautiful, minimal microkernel approach would have won out.

Why can't distros be a package of microkernel + drivers?


Why does it matter? If you want to read scheduling code of linux, no one forces you to read 5M+ lines of driver code. If linux was just a microkernel, and distros shipped drivers, you'd still have to read roughly the same number of lines to understand linux scheduling. If you want to read the entire linux code, you can still do it, just ignore the driver code if you're not interested in them; otherwise in your scenario you'd be reading driver code in distros as well.


We literally read basically all of the Linux scheduler/syscall code in my computer engineering classes in the first couple weeks, the rest of the course was to reimplement a subset of it for a custom (RT)OS on some Cortex-M3 microcontroller dev boards.

The first week or so of the process was learning how to go from a new Debian machine (with the expectation we'd only used Windows or Macs before) through "git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git" (not on Github at the time) to finding the function called by "man 2 fork" and navigating the source tree/unmasking preprocessor stuff to show the actual implementation. Yes, you can easily make a wrong turn if you're trying to do that on your own. But the actual linux/kernel directory, which has most of the parts you'd want, isn't that much larger, and a lot of the difference is modern requirements like power saving and security.


> Why can't distros be a package of microkernel + drivers?

Performance. Basically, the required amount of context switches (since all drivers run in user mode) impacts performance and cache use in a negative way compared to monolithic kernels. Whether this is still a significant issue with modern designs, I don't know, but that was the argument back in day (i.e. the early 90s when Linux came about).


Because a monolith is easier to get working on a quick timeline and it will generally outperform code that makes compromises for the sake of its' developers sanity. At least as long as you can keep the insane developers from eating each others eyes.


I feel obligated to point out that there is a notable lack of eyeless developers in Linux kernel developer community, even after many, many years of development.


because nobody is willing to pay alot extra for software of high quality


It's not even guaranteed to _be_ of high quality, is the thing. The same arguments play out all over CS: monolith vs microservices, distributed vs. centralized. Pick a poison and do it well - there's no 100% argument that says one side is absolutely better than the other here.


The other comments here have already mentioned the drawbacks of a microkernel, but I do wish Linux was more modular --- for example, like Windows where drivers are separate loadable modules, and Microsoft sure doesn't maintain the majority of them either. The Linux/Unix approach of having one huge kernel binary just doesn't seem all that efficient, especially if it contains drivers which would never be used.


The Linux kernel does use loadable modules, and indeed that is how most drivers are used. My relatively boring laptop setup is using 219 modules at the moment.


Most drivers in Linux are loadable modules- see the output of `lsmod`.

They are built (usually) at the same time as the kernel, yes, due to the lack of ABI stability guarantees, but most drivers you don't use won't take up any RAM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: