Guys! The point of this article is not to prescribe the only method of displaying human-readable file sizes. Obviously one could use `ls -lh`; the author clearly demonstrates that he is willing and able to read man pages to find answers.
Rather, this is a pretty interesting look into what it actually entails to make what ought to be a very simple and straightforward change.
It turns out that these simple changes are hard! Not just in identifying the piece of code to modify, but that man pages are often incomplete or unclear. It also illustrates the complexities behind making software portable - in this case, using the nation-neutral place separator. It also reminds us that solving what is on the surface a simple problem lets one uncover all sorts of interesting and messy details underneath - including more problems to solve!
These are steps that he'd have to take no matter what the code or feature. This article is not "complexity for complexity's sake", it's illustrating the complexity of making changes to any piece of code - and that it is surprisingly difficult for something that one would think is very easy!
On the other hand, you could consider it a cautionary tale about not reinventing wheels, because a problem that may seem trivial at first often turns out to be far more complex than expected.
This is why you try to re-use work when possible, rather than endlessly reinventing things, because while sure, adding a comma to the printf string is easy enough, your assumptions (English locale, compiler not trying to be clever) are going to quickly become visible as things fall apart because your assumptions aren't in line with the system's assumptions.
What this story really demonstrates is that without a clear understanding of how a system is designed and the basic assumptions it makes, just "hacking on the code" is just as likely to break things as it is to fix them.
You assume that systems tend to be well-designed and the basic assumptions it makes are justifiable, and that the systems faithfully implement that design and those assumptions.
It sounds like the author found a bunch of bugs in the process of making a simple code change. That happens pretty frequently, and doesn't mean that the author should let the priests of the cathedral deal with this UNIX thing that is too complicated for the laity to hack on. It just means there is no priesthood.
It is not very easy because it is an unusual request.
But I still wonder if this is easier than ls -l | sed -e :a -e 's/\(.*[0-9]\)\([0-9]\{3\}\)/\1,\2/;ta' ??
The investigations would be interesting it they were more complete, i.e. if the actual result was a change in the locale which could be appliable to other tools printing numbers besides ls (in the author TODO).
I mean, will it work with bc?
At the moment it's not better than an shell script alias giving the output to sed, but it is more complex - you have to recompile a binary for every OS you use.
It's a 10 seconds trick that did the job and did not require recompiling.
EDIT: seeing how it has been downvoted, IMHO hacking is all about time and effectiveness. if you believe fixing ls print formats is such a crucial problem that it requires more than seconds of your time, we have different values.
Feel free to support the argument by showing your skills and improving the example.
You are entirely missing the point of this whole thing.
It is exactly about showing how hard it is to get a trivial change right in every detail. Your 10 second hack is what is wrong with 10 second hacks in general and even with most real solutions that are not carefully thought out.
It's not only that the devil is in the details it is all details. And you need to get all of them right, not just the current subset of the problem that you happen to be working on.
While I think that is the point, I don't know if was the author's originally intended point. You see 'hacking it' would imply just getting it to work, I expected to see something, well hackish, like replace the format specifier with %s and then wrap the number in a function call to something which read an environment variable to figure out how to print the number. That would be a hack, what Greg is doing is engineering a change to ls(1) which allows for pretty printing the numbers. So to my way of thinking the title is wrong, it sets the expectation of a hack and leads to an engineering exercise.
I think that's reflective of how experience levels change your perspective on how to approach a problem.
Someone with command line experience that is mostly a user of system utilities would think passing the output of ls through a filter is the way to go whereas someone with C experience that already has insight in how Unix is architected would do something more along the lines of what the author is doing. That's pretty much how unix got built to begin with, and probably both parties would qualify their solution as a 'hack'.
"Hacker" has many meanings. A quick read of the Wikipedia entry for the term shows you that. RFC 1392 defines a hacker as "A person who delights in having an intimate understanding of the internal workings of a system, computers and computer networks in particular." The only way to tell which definition is in use is via context, and for this case the meaning is unambiguously aligned with the RFC 1392 definition.
(Note that this RFC comes from the days when 'hacker' was becoming a widespread term for someone who breaks into computer systems; the RFC attempts to distinguish between a 'hacker' and a 'cracker.')
It's a disgusting hack and it works very well. I didn't find anything useful in the man page so I wrote this instead. Looks like there is something in the man page but I think my hack was probably faster.
Your solution is the better one, but that does not take anything away from memset's argument, though... This is not about the solution. This is about the problem of having all these yaks to shave just to add a comma to some program's output.
> It turns out that these simple changes are hard! Not just in identifying the piece of code to modify, but that man pages are often incomplete or unclear.
This is a great shame. I like OpenBSD's approach to man pages - incorrect documentation is a bug and can be as severe as a bug in code; correct documentation is important.
Fixing up man pages is something that non-technical volunteers could help with, except when it's hard to grok what the code actually does vs what it should do.
> Fixing up man pages is something that non-technical volunteers could help with
Another problem with that is that most tools used in this process are made with technical users in mind. A lot of people can expand documentation, but sending manpages patches in a bug tracker is a technical step.
Mr. Lehey managed to improve the system in such a way that it will subsequent changes for him and others easier, independently of whether the specific change to `ls` is never adopted. It's “five whys” applied to “why is this hard” and “how can I make it easier”. It's more effort, with a greater chance that much of it will survive the current context and requirements.
Some people improve the area they travel through, others leave debris, and many are noops who make no difference to those who come after. If there's not enough entropy fighters like Mr. Lehey working a system, it turns to kipple.
Most annoying is that gcc warns about perfectly valid and logical code. That causes people to ignore warnings, and before you know it, you have a piece of software that has more warnings than lines of code.
Alternatively, when you cleverly figure out how to work around the warning, like the author does, you now prevent that rule from triggering even when it's right. Clearly a better unit test is needed.
The printf ' format specifier is not Standard C. It's in neither the C99 Standard nor the new C11 Standard. So it's not actually valid, and it's a coincidence if it happens to work with your Standard Library.
Consider that the compiler generating that warning knows only Standard C, and in fact you could be pairing it with any C library, including those that are strictly conforming and don't support the ' extension.
> Alternatively, when you cleverly figure out how to work around the warning,
Or just read the docs (it should be "%'*jd "). Then no warnings. (IIRC ' is in C99 and -std=gnu99 targets c99 + gnu extensions.)
The same story with the rest. Two ways of doing things — learn & think and just do it right or twiddle until it seems to likely maybe work (possibly). The article is about the latter. Plus "blame the compiler".
That warning seems like a nasty hack anyway, if the compiler can't inline a local when running safety checks.
It is super scary that the compiler appears to be using a different constant from printf for its format checker, that shows it probably isn't using a pattern supplied by printf.
Yes to both points. I haven't read the source code, but this feels like a case of code-oriented-programming instead of data-oriented-programming. In other words, they write printf twice: once in the C library, and once in the warning system. A more careful programmer might write both in terms of a ruleset that's declared a single time.
Compiler and standard library are two separate codebases, in fact gcc gets routinely used with standard libraries other than GNU's. Reimplementing printf parsing probably is the cleanest solution.
No reason that glibc can't include a validate_format_string routine to be run at compile-time by gcc. There are already so many conditional compilation sections in both codebases that another #ifdef GNU in gcc isn't going to hurt anyone :)
For me, -h makes it more difficult to quickly compare the sizes of files in one list by glance.
This is something that I have to do often enough that it has prevented me from adding -h to my ls alias.
I'd have to use it for a bit to be sure, but the post's suggestion seems a pretty good 'best of both worlds' solution to me.
I run into the situation more often with 'du' (when trying to find which subdirectory tree has excess junk in it), to the point that, while 'du -h' is human readable, it's not particularly sortable so:
du -hs $( du -s * | sort -k1nr,1 -k2 | head )
.... which will return the human-readable output, based on numerically sorting the full numeric output. Eyeball comparisons are easier as you're aware that results are already sorted by size.
A reasonably recent sort from GNU coreutils has a -h (and equivalent long option --human-numeric-sort) which properly sorts the output of du -h, meaning you can do:
I guess people can be very different in this respect. I really need the full number (or at least all of the numbers in the same unit) to be displayed. I don't find the 'human' format helpful at all when looking at ls output.
I use "du <directory> -h -d 1|sort -h" for this because large files may be nested in directories and ls doesn't display the size of content directories (as far as I know). The output is sorted. Note that the "-d" flag for "du" doesn't work with all versions, but there were similar flags on all systems.
My thoughts exactly. This is just complexity for complexity's sake. Useful as an exercise, but the -h flag already does this is an even more readable manner.
gobble.wa@gmail.com made a similar post to the freebsd-questions mailing list a month ago. In his case the question was how to print an md5sum along with the file names in a given directory. I saved it because I thought it was a clever hack.
A lot of times I catch myself in the mindset of taking a step back and saying "here are the set of tools I have at hand to accomplish a task" without realizing that I should simultaneously be taking a step "in"--so to speak--and acknowledging that the tools I have to work with are not immutable tools cast of iron; they are malleable and can be re-tooled to suit my purposes.. and that sometimes going that route can be the simplest--and in fact "best"--solution.
(that said, I do see the utility, since it gives a more obvious visual queue as to the order of size differences... but if you're doing anything with the sizes programatically, you have to remove the commas afterwards... Short version: if you're going to do this, make it a unique flag, or a new flag modifier to the -l flag... don't overload the -l flag without recourse...)
FYI, there is no need to change GNU ls to get that behavior. You can make it use your locale's separator with either the --block-size="'1" option or by setting the LS_BLOCK_SIZE envvar to that same string:
$ LC_ALL=en_US.UTF8 ls -og --block-size="'1" .
-rw-------. 1 5,145,416 Oct 5 16:44 A
-rw-------. 1 5,137,692 Oct 4 14:37 B
-rw-------. 1 5,147,168 Oct 8 07:52 C
This feature is documented in the "Block size" section of the coreutils manual: i.e., you can type this to see it:
Now let's consider software lifecycle in a large context: longevity of forks.
If he doesn't send the changes off to upstream, and make a case good enough for them to be approved, then all this dooms him to maintaining his fork on all the platforms where he wants it until he gets sick of it or convinces someone else to do it for him.
Man, such fragile stuff. Why not code a function yourself that turns a number into a string representing it decimally with the commas every three digits. I normally like and use good library functions and standards, but if they're that fragile and depend on your environment then no thanks.
Yes, in my country we use spaces to separate thousands and "," is used for what the decimal point does in the US. In my handwriting I use that notation.
However, from a computer, I (and I'm certain I'm not the only one) actually expect it to output the US notation.
I'd much rather have a computer always output the same format (and that happens to be the US format), than try to be smart with locales, when the end result is that some things will do this, others that. Makes stuff harder to use, and when programming, harder to parse.
I've once had to touch Excel on a Windows machine configured for a non-US language, and it refused to import a CSV file that had commas, even though CSV means comma separated values. It required semicolons due to the locale settings of Windows. This stuff should not happen. A CSV is meant for computers, and to be interchangeable, not to use different types of commas and refuse to work with other types depending on user locale settings...
Of course, when publishing or printing, that's a whole different matter, and there it better get the locale of your country perfect. But this here was about output in the console, which is often meant as input of other scripts etc...
For such large numbers, would it make more sense to use groups of 6 instead of 3? This would allow you to easily identify the megabyte position with the next separator at the terabyte position.
Human readable form in bytes? I thought it you had a file greater than a megabyte hen it shows in MB? What if you want to read it in bytes with the thousands separator?
Is this front page worth materiel? I mean, it's good you took the time to add a ', but doing the same with sed would have been faster.
Alternatively, do you know about ls -lhrS? It will print size in human formats and reverse sort the files by size - ie the bigger will be at the end of the list
Yes, it is front-page worthy, because the details give us insight into how one might make any sort of change to a codebase like this. "If you want to be a hacker, these are specific steps I took to hack something up." In a domain that most of us are not very familiar with!
More to the point, it also shows that an "easy" change isn't so easy. In the course of solving this "easy" change, the author added six new items to his personal to-do list.
As edw519 has said, a one line code change takes six days to implement.
Editing ls/print.c sourcecode and recompiling is not hacking in my definition.
I usually consider the portability of the solution. I have linux i386 and x64 machines, my arm n900, an osx latop, etc.
Recompiling ls (or, heavens forbid, cross compiling!) for each machine may be a bright idea.
Adding a line to your profile that will take advantage of the existing tools like sed is closer to hacking in my definition, because it tries to think about the bigger problem - but still I wouldn't dare calling the following "hacking":
Sure, this is fair. I tend to use "hacking" to mean "tinkering", and it could be said that I have a fairly loose usage of the word.
To me, this article gets to the crux of what I find particularly delightful about hacking (tinkering?): unraveling layers of complexity underneath. I feel like I have a little better understanding of what's happening when I punch in `ls`, and I think that particular delight and knowledge is the kind of thing that appeals to tinkerers (hackers? :p) like myself. So - in my view - entirely appropriate for this crowd!
But you have a fair point; there's nothing particularly out-of-the-ordinary of this code or process, and in that sense, isn't newsworthy to hackers.
(I did not downvote you incidentally; I think it's interesting to get a sense of peoples' different thresholds for what constitutes "hacker." I play the saxophone, and an instructor once told me that people always came up to him and said "I want to be a musician. How can I do that?" Well it turns out that the moment you play "hot cross buns" on your instrument, you are indeed a musician. Perhaps not a skilled one, but you have in fact made music. I think of hacking in a similar way, and freely admit that it is a loose use of the word!)
I can totally agree with calling that tinkering - and also that it is interesting, if only (due to my diverging opinion on the merits of the approach) as a warning tale about how far one should go to try and fix a problem.
But, just like you, I consider that not newsworthy to hackers, yet at the moment it is the #1 item on HN and it kinda makes me sad especially because of the threshold - the idea that some people do consider that hacking - here of all places - is chilling :-/
Worse - #2 item is "more people should write". I beg to differ- more people should code, so that fixing a printf and recompiling ls wouldn't be newsworthy.
I'm sorry if it was interpreted as being rude- it was not the point - I just wanted to present alternative approaches to the problem, because reconsidering the problem is sometimes the right thing to do, especially when it escalate quickly in complexity.
I think you're complaining too much about this momentarily being the top story on HN: (it feels to me as if) usually the top story is gossip about how much money some startup raised or an Apple vs Android piece. I find this to be an improvement.
It's definition under my definition of 'hacking.' Consider that the GNU tools already implement non-portable extensions; this would be yet another one, were it added. Also, portability is not an essential requirement for "hacking." I do small programming exercises sometimes in order to understand a facet of how things work. Consider this as an exercise in how to use locale-dependent specifiers.
As I pointed out elsewhere, your sed code doesn't work because it changes too many numbers - including filenames and dates - in the output. Also, if the exercise is to understand localization then your sed code isn't appropriate because it hard-codes "," when some locales use a "." as the thousands separator.
And for completeness, your alias can't then mix and match other flags, like "ls -lRt". It's a single command with strange side-effects if you use it incorrectly:
% lll -art
sed: illegal option -- r
usage: sed script [-Ealn] [-i extension] [file ...]
sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
Hmmm - the sed code was written to give a working 1 line example with the answer- just an example, or else some people might consider that it could be high-end wizardy worthy of recompilation.
Instead of complaining about an obvious flaw in the example, maybe you could be more constructive and fix it.
Hints:
- if you want to pass other flags, make it a function and use $@
- if you want to respect the filenames and dates, either fix the regex or write it in perl.
Shouldn't take you long - back from 1999 in perl FAQ:
You did not give a `working 1 line example with the answer.' It doesn't do the right thing, and I pointed out two failure cases.
My constructive criticism is that your approach is wrong, should not be done, and cannot be easily fixed. One should never attempt to process the general output of ls. It's doable - I lived through the years of processing the "list"/"ls" output from random ftp servers - but it's nasty. Sure, you can add '$@' but then you have to worry about, say, "-i", which shows the inode number as a new leading column or "-n" which shows user/group ids instead of names. How does your alias/perl script figure out which column is the one which needs the commas?
You'll either end up with a very fragile system (producing the erroneous output as your 1-liner does) or you'll end up trying to understand most of the ls command-line arguments and/or heuristics to guess based on the output. The well-known "BUGS" section of the Unix man page says "To maintain backward compatibility, the relationships between the many options are quite complex." You're in for a long slog if you go this route.
Yes, if you want a one-off solution for a specific set of outputs then your approach would work. That would be also be boring and trivial. The linked-to article, on the other hand, was interesting.
Anyone who has ever had to work on a large system will instantly recognize what he's done here, and admire his approach, and light a candle in support of ever getting this patched all the way through the various broken projects. Remember, he's got to get a patch approved for gcc to get this to work, and only once that is done, in production, and generally available will he be able to submit a patch for ls. This is the reality of disparate projects with many maintainers and schedules. This may seem to be a trivial example, but it is entirely indicative of the process.
Yes, because it's interesting, and because for the OP in particular they've spent a little bit of effort up front to save themselves time in the long run, and learn something too.
Rather, this is a pretty interesting look into what it actually entails to make what ought to be a very simple and straightforward change.
It turns out that these simple changes are hard! Not just in identifying the piece of code to modify, but that man pages are often incomplete or unclear. It also illustrates the complexities behind making software portable - in this case, using the nation-neutral place separator. It also reminds us that solving what is on the surface a simple problem lets one uncover all sorts of interesting and messy details underneath - including more problems to solve!
These are steps that he'd have to take no matter what the code or feature. This article is not "complexity for complexity's sake", it's illustrating the complexity of making changes to any piece of code - and that it is surprisingly difficult for something that one would think is very easy!