Decoded: GNU coreutils (2019)

ufo · on March 10, 2021

The biggest takeaway for me is that I learned about the existence of some utilities that I had never known were there. Specially "factor" and "tsort".

vram22 · on March 10, 2021

There are many cool / useful less-known utilities in GNU / Linux.

Check man7.org for good, though brief info on many of them.

I had explored many of them a while ago.

Maintained by Michael Kerrisk, author of The Linux Programming Interface, a kind of reference bible for Linux APIs and system calls.

Edit: many of which are used in making such utilities.

MaxBarraclough · on March 10, 2021

Don't forget recutils.

https://www.gnu.org/software/recutils/manual/A-Little-Exampl...

https://en.wikipedia.org/wiki/Recfiles

vram22 · on March 11, 2021

Right, good point. It has also been discussed on HN a few times:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

rustyminnow · on March 10, 2021

Here's the list of coreutils pages for anybody else interested: https://man7.org/linux/man-pages/dir_by_project.html#coreuti...

rwmj · on March 10, 2021

There's also "moreutils"[1] which is a set of useful additional tools. "errno" is indispensable if you're a Linux programmer.

[1] https://joeyh.name/code/moreutils/

cillian64 · on March 11, 2021

I use sponge from moreutils all the time. You know how you always want to ‘sort <file >file’ or something but that doesn’t work because file gets truncated before it gets read? With sponge you can do ‘sort <file | sponge file’

rank0 · on March 10, 2021

That's pretty neat! But I feel some of those tools can just be implemented with coreutils, pipes, and redirects...

jasomill · on March 11, 2021

While not as fancy as errno(1),

    perl -e '$! = shift and print "$!\n"' ERRNO

works to decode (nonzero) ERRNO if Perl is installed (per strerror(3) in the C runtime library linked to Perl, so YMMV on non-POSIXish platforms).

Speaking of portability, Microsoft's err.exe[1] is a conceptually similar tool that references a considerably larger collection of Windows error messages (user and kernel mode Win32 error codes, COM HRESULTs, etc) and is therefore far more useful on Windows platforms than anything naïvely implemented in terms of C runtime errno (e.g., my stupid Perl one-liner).

gautamcgoel · on March 10, 2021

I'm blown away by this project. What a great way to learn about the coreutils, and also see how C is written in the real world! I'm curious how the author made the diagrams explaining each utility - did he use Inkscape?

gbajson · on March 10, 2021

I have just spent 5 minutes trying to find any useful use cases for 'vdir'. Does anyone of have idea why other 'ls' has been created?

JonathonW · on March 10, 2021

Someone years ago wanted a shortcut for `ls -lb`, didn't want to change aliases on every machine they use, and had access to the coreutils source?

In GNU coreutils, it's not really "another" ls; it's the same source code with a preprocessor macro set differently (ls, dir, and vdir all share the same source).

tyingq · on March 10, 2021

Really lovely work. I'm curious if the png flowcharts are generated from data, or hand drawn.

Edit: Also, some easter egg looking thing at the bottom right of the page:

  ##*#**##****#*#**/\##*###*****#**#*#*#**#******#**#*#*####*#*##*

</div>

Edit: Fixed asterisks, I think.

jedimastert · on March 10, 2021

  ##*#**##****#*#**/\##*###*****#**#*#*#**#******#**#*#*####*#*##*

HN's formatter is having a time with that many asterisks

I don't recognize the format, can someone help me out?

tyingq · on March 10, 2021

It's 64 characters. I'd guess binary if that /\ wasn't in the middle.

bluesign · on March 10, 2021

seems like morse code

First part is maizure

Ps: hn messed up with stars :)

tyingq · on March 10, 2021

Ah, good call. HN ate your asterisks, but yeah, the first bit before the / is "maizure" in morse. Though having no spaces between the letters makes for some ambiguity...hard to decode the rest.

  ## *# ** ##** **# *#* * == maizure

I can find some long words with a dictionary approach in there, like:

  ARRIVE -> .-.-..-......-.
  CLEVER -> -.-..-......-..-.
  DESTINY -> -......-..-.-.--
  FENCED -> ..-..-.-.-..-..
  MEMBER -> --.---.....-.
  MISTER -> --.....-..-.
  (etc)

But, too many variations that direction too.

JdeBP · on March 10, 2021

You have certainly got further than people did in https://news.ycombinator.com/item?id=17116855 .

tyingq · on March 10, 2021

update...pretty sure the ending is "MEMBER.FSF.ORG"

Maybe an email address?

JdeBP · on March 11, 2021

Now read http://maizure.org/projects/faq.html . (-:

chias · on March 11, 2021

Nicely done!

bluesign · on March 10, 2021

Yeah millions of combinations, I tried but was not patient enough.

for the curious: https://www.jbowman.com/remorse/

bogomipz · on March 11, 2021

What a fantastic post! If the author is reading - please post more of these Linux "decoded" series. You have a great visual style and the content is wonderfully succinct. I highly recommend reading the "Decoded: The 'top' utility (procps)" post as well:

http://maizure.org/projects/decoded-top-procps/index.html

ojnabieoot · on March 10, 2021

Very nice work and much easier than trawling through the repository.

Some ignorant and probably cliched musing: when I look at small utilities like these I am always struck by a seeming distinction between best practices for little C programs versus best practices for large C applications (the author of the post touches on this ad well).

In particular, the explicit flow (including goto) and “pedantic” style is actually quite appropriate for something < 1000 lines and where the expected behavior is extremely well understood. In cases like pwd, mkdir, etc, trying to abstract too much is arguably a mistake for maintainability and understanding.

I say all this as an immutable functional-first dev who hasn’t done much native code :) And I think the various type-safe / memory-safe / etc versions of these tools are worth developing. But there’s something to be said about well-optimized native code that clearly “does what it says on the box” in a way that’s accessible to anyone who understands basic Linux programming - even if they can only contextually read C code.

(My only real gripe is typographic / linting related, mostly due to being a whippersnapper).

kiwidrew · on March 10, 2021

This is in keeping with the style of the original Unix utilities.

Having a handful of global variables reduces the amount of stuff being passed around from function to function; utilities don't need to worry too much about free()ing dynamic allocations, since that gets cleaned up on exit anyways; none of the code has to be re-entrant, because each invocation of the utility is running in its own process.

setpatchaddress · on March 10, 2021

Could not disagree more about goto. Small programs always turn into larger ones. And what you have at the end if you're not from the beginning using practices appropriate for larger programs is spaghetti code.

I'm not criticizing it in context -- a lot of this code dates back to the mid 80's if I'm not mistaken. But always write new code using scalable idioms.

monocasa · on March 10, 2021

I'd like to see the use of goto.

There's two 'allowed' uses in C that are common and represent good code even today. goto error cleanup stubs, and goto in virtual machine dispatch loops.

The size of the codebase doesn't really matter for those cases; they're largely considered the idiomatic way to go about the problems they're trying to solve.

not2b · on March 10, 2021

The error cleanup role is handled in a number of other languages (Ada, VHDL, Perl) by letting the programmer name a block and having a statement that terminates that block or (for a loop) goes to the next iteration, even if this terminates multiple loops. The effect is similar to the C goto way of doing that, but it's more controlled and easy for compilers to deal with.

monocasa · on March 10, 2021

Oh, for sure, other languages have different idiomatic constructs that don't require such a heavy hammer as goto to achieve s similar effect.

Even in C, if you're writing Microsoft only code, seh is probably a better mechanism than goto error.

I'd argue that the defer statement in go (and the surprising side effects of it, like that it's function instead of block scope like you might otherwise expect) ultimately come from trying to wrap this idiom in a construct that's better supported by the language.

My point though is that in relatively standard, portable C, there are valid, idiomatic use cases of goto, and it's not quite so easy to say 'eww goto' in those very specific circumstances.

robocat · on March 10, 2021

I skimmed some Linux code the other day and noticed that goto is used for more than those two situations. Maybe just cruft...?

Search for retry: or handle_itb: in https://github.com/torvalds/linux/blob/master/fs/ext4/resize...

Or fixleft: or copy: in https://github.com/torvalds/linux/blob/d158fc7f36a25e19791d2...

monocasa · on March 10, 2021

Yeah, the retry piece is a bit more controversial. Some people think that it's cleaner for code that's probably already nesting loops, but I tend to break it apart in different ways. That one I generally don't push too hard in review, but require more tests to shore up confidence.

And frankly, the fix_left style code you see just isn't modern idiomatic C, IMO. In a code review I'd have them either combo of write a block comment explaining why it's necessary to be weird and a lot of test cases for when someone inevitably tries to rewrite it, or just rewrite it in the first place.

Some of the areas of the Linux kernel aren't exactly known for being the best written C (unfortunate as that is) and you're seeing some of that.

overboard2 · on March 10, 2021

If this program has remained small for 40 years, then maybe not all small programs turn into larger ones.

ojnabieoot · on March 10, 2021

I agree with you in general. But I think in this specific case it’s a bit more complicated: the downsides aren’t as bad as they normally would be, and the use of primitive flow constructs arguably has an advantage in this domain:

POSIX and similarly stuffy requirements (even if “soft”) means that this code is fairly static. While there is some bloat in the pragmas, etc., these applications are necessarily slow to change and I think it’s reasonable to say that they won’t suffer from feature bloat anytime soon. So the normal software risk considerations are a bit different here. Further, any changes to the code will be fiercely reviewed, and the individual programs are small enough that increases in complexity will be quickly spotted. Relatedly, these programs are small enough that, if a refactor to more structured code were necessary, the work would be quite feasible. So while the risks of goto are real in any C program, in practice I think they’re quite minimal here.

And I do think you’re missing an advantage. These are core userspace functions that perform safety- and security-critical kernel interactions. So I definitively agree there is a strong argument to use safe code, modern abstractions, and so on. This is especially true for modern PCs that really can afford to spend a few extra cycles creating a folder.

But a modern code construct, correctly applied, is only as safe as the compiler. This is not guaranteed! A common “gotcha” with buggy C compilers is inappropriately pruning instructions because the compiler optimizes away a loop or else statement. It is hardly a frequent issue but similar bugs have shown up in recent gcc/clang releases. And in particular core developers who are working on operating systems are more likely to be using shaky C compilers.

Using gotos and ugly global state has the distinct advantage that generated assembly tends to have less “surprises.” If there is a bug in the compiler it will be less well-hidden; if there is a bug in the program then there is less mental work between analyzing the C and analyzing the disassembly for debugging.

Again, in general I think you’re correct and that my argument is ultimately more of a judgment call.

EDIT: I didn’t really want to address any structural advantages of goto for, e.g. exception handling via breaking loops earlier, etc. I am not a domain expert enough to comment appropriately but it does seem there are cases where properly abstracted cleanup code in C is more spaghettified than a goto: https://lkml.org/lkml/2003/1/12/203

not2b · on March 10, 2021

If the flow graph doesn't have a clean nested structure, this impedes compiler optimization. It can be possible to normalize it, but this may require the compiler to clone the code. Compilers are pretty good these days; if you've experienced a C compiler "inappropriately" optimizing something away the most likely cause these days is not a compiler bug, but a software developer who doesn't understand rules related to aliasing or undefined behavior.

I do agree that the specific use of goto to jump cleanly out of several loops is appropriate: the problem is that C lacks clean constructs for exiting named blocks. That would be preferable to general goto and doesn't harm optimization, the flow graph is still easy to analyze, convert to SSA form and the like.

dang · on March 10, 2021

Discussed at the time:

Decoded: GNU Coreutils - https://news.ycombinator.com/item?id=20328650 - July 2019 (55 comments)

mraza007 · on March 10, 2021

OMG I’m so surprised I was going to post a question on HN yesterday that how can I learn about GNU Coreutils and today I wake up see this

What a coincidence!!!

Truly an amazing resource on GNU coreutils

psychoslave · on March 10, 2021

"How is GNU `yes` so fast?" was already discussed on this topic: https://news.ycombinator.com/item?id=14542938

mshockwave · on March 10, 2021

Very interesting way to visualize some of the most important cornerstones in *nix systems