Hacker News new | past | comments | ask | show | jobs | submit login
Hacking ls -l (lemis.com)
199 points by drp4929 on Oct 13, 2012 | hide | past | favorite | 88 comments



Guys! The point of this article is not to prescribe the only method of displaying human-readable file sizes. Obviously one could use `ls -lh`; the author clearly demonstrates that he is willing and able to read man pages to find answers.

Rather, this is a pretty interesting look into what it actually entails to make what ought to be a very simple and straightforward change.

It turns out that these simple changes are hard! Not just in identifying the piece of code to modify, but that man pages are often incomplete or unclear. It also illustrates the complexities behind making software portable - in this case, using the nation-neutral place separator. It also reminds us that solving what is on the surface a simple problem lets one uncover all sorts of interesting and messy details underneath - including more problems to solve!

These are steps that he'd have to take no matter what the code or feature. This article is not "complexity for complexity's sake", it's illustrating the complexity of making changes to any piece of code - and that it is surprisingly difficult for something that one would think is very easy!


On the other hand, you could consider it a cautionary tale about not reinventing wheels, because a problem that may seem trivial at first often turns out to be far more complex than expected.

This is why you try to re-use work when possible, rather than endlessly reinventing things, because while sure, adding a comma to the printf string is easy enough, your assumptions (English locale, compiler not trying to be clever) are going to quickly become visible as things fall apart because your assumptions aren't in line with the system's assumptions.

What this story really demonstrates is that without a clear understanding of how a system is designed and the basic assumptions it makes, just "hacking on the code" is just as likely to break things as it is to fix them.


You assume that systems tend to be well-designed and the basic assumptions it makes are justifiable, and that the systems faithfully implement that design and those assumptions.

It sounds like the author found a bunch of bugs in the process of making a simple code change. That happens pretty frequently, and doesn't mean that the author should let the priests of the cathedral deal with this UNIX thing that is too complicated for the laity to hack on. It just means there is no priesthood.


If your definition of "breaking" things is "they don't work on the first try", everything I've ever tried to do was broken.


It is not very easy because it is an unusual request.

But I still wonder if this is easier than ls -l | sed -e :a -e 's/\(.*[0-9]\)\([0-9]\{3\}\)/\1,\2/;ta' ??

The investigations would be interesting it they were more complete, i.e. if the actual result was a change in the locale which could be appliable to other tools printing numbers besides ls (in the author TODO).

I mean, will it work with bc?

At the moment it's not better than an shell script alias giving the output to sed, but it is more complex - you have to recompile a binary for every OS you use.


When I use your sed transformation on one of my directories, I see entries like:

    drwxr-xr-x  1,653 dalke  admin       56,202 Mar  5  2,012 pubchem
    -rw-r--r--     1 dalke  staff       59,252 Nov 16  2,011 pubchem_10,000.fps.0.9.cluster
I don't want to see the year written "2,012", and the file name is 'pubchem_10000" not "pubchem_10,000".


It's a 10 seconds trick that did the job and did not require recompiling.

EDIT: seeing how it has been downvoted, IMHO hacking is all about time and effectiveness. if you believe fixing ls print formats is such a crucial problem that it requires more than seconds of your time, we have different values.

Feel free to support the argument by showing your skills and improving the example.


You are entirely missing the point of this whole thing.

It is exactly about showing how hard it is to get a trivial change right in every detail. Your 10 second hack is what is wrong with 10 second hacks in general and even with most real solutions that are not carefully thought out.

It's not only that the devil is in the details it is all details. And you need to get all of them right, not just the current subset of the problem that you happen to be working on.


While I think that is the point, I don't know if was the author's originally intended point. You see 'hacking it' would imply just getting it to work, I expected to see something, well hackish, like replace the format specifier with %s and then wrap the number in a function call to something which read an environment variable to figure out how to print the number. That would be a hack, what Greg is doing is engineering a change to ls(1) which allows for pretty printing the numbers. So to my way of thinking the title is wrong, it sets the expectation of a hack and leads to an engineering exercise.


I think that's reflective of how experience levels change your perspective on how to approach a problem.

Someone with command line experience that is mostly a user of system utilities would think passing the output of ls through a filter is the way to go whereas someone with C experience that already has insight in how Unix is architected would do something more along the lines of what the author is doing. That's pretty much how unix got built to begin with, and probably both parties would qualify their solution as a 'hack'.

It's all perspective.


"Hacker" has many meanings. A quick read of the Wikipedia entry for the term shows you that. RFC 1392 defines a hacker as "A person who delights in having an intimate understanding of the internal workings of a system, computers and computer networks in particular." The only way to tell which definition is in use is via context, and for this case the meaning is unambiguously aligned with the RFC 1392 definition.

(Note that this RFC comes from the days when 'hacker' was becoming a widespread term for someone who breaks into computer systems; the RFC attempts to distinguish between a 'hacker' and a 'cracker.')


The easiest and naive way to fix that is to buffer the whole thing. Don't use this to do `ls-comma -lR ~`, but only on small directory hierarchies.

https://github.com/samsonjs/bin/blob/master/ls-comma

It's a disgusting hack and it works very well. I didn't find anything useful in the man page so I wrote this instead. Looks like there is something in the man page but I think my hack was probably faster.


this should fix that. ls -l | sed -e :a -e 's/\s\(.*[0-9]\)\([0-9]\{3\}\) /\1,\2 /;ta'

correction: sorry it will only fix the modifications to the filename - the year is still broken.


The following modifies only the file size:

  ls -l | perl -pe 'while(s/^((\S+\s+){4})(\d+)(\d{3})([^\d].*)?$/$1$3,$4$5/){}'


Yes. So long as you only want to modify ls -l. And don't care about column alignment. The original author wants output to look like:

    -rw-r--r--  1 root  wheel  16,596,907,252 24 Dec  2009 boskoop.disk0.bz2 
    -rw-r--r--  1 grog  wheel   4,173,914,809 20 Jul  2006 boskopp.tar.gz 
With your perl one-liner on my directory I get mis-aligned columns:

    -rw-r--r--     1 dalke  staff  3,236,397,056 Sep 13  2011 pubchem.fps
    -rw-r--r--     1 dalke  staff   712,181,172 Sep 13  2011 pubchem.fps.gz
That's ugly. It should be:

    -rw-r--r--     1 dalke  staff  3,236,397,056 Sep 13  2011 pubchem.fps
    -rw-r--r--     1 dalke  staff    712,181,172 Sep 13  2011 pubchem.fps.gz


Your solution is the better one, but that does not take anything away from memset's argument, though... This is not about the solution. This is about the problem of having all these yaks to shave just to add a comma to some program's output.


> It turns out that these simple changes are hard! Not just in identifying the piece of code to modify, but that man pages are often incomplete or unclear.

This is a great shame. I like OpenBSD's approach to man pages - incorrect documentation is a bug and can be as severe as a bug in code; correct documentation is important.

Fixing up man pages is something that non-technical volunteers could help with, except when it's hard to grok what the code actually does vs what it should do.


> Fixing up man pages is something that non-technical volunteers could help with

Another problem with that is that most tools used in this process are made with technical users in mind. A lot of people can expand documentation, but sending manpages patches in a bug tracker is a technical step.


If one was so non-technical as to be incapable of using diff and a bug tracker, perhaps they shouldn't be writing manpages.


Sure, they probably shouldn't be creating man pages. But they could be great at copy editing and polishing existing man pages.

Or they could be expert translators.

So it's a shame that man pages are not as good as they could be.

There are other forms of documentation, but it'd be nice if man pages were the best the could possibly be.


I enable this for GNU ls like:

    alias ls="BLOCK_SIZE=\'1 ls --color=auto"
The above is a bit hacky and not very UNIXy as it's lumping more logic into ls, rather than splitting out into functional units.

Number formatting being a very common requirement, I've proposed a design for a new numfmt GNU coreutil

http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085...

which would be used like:

    ls -l | numfmt --field=5 --format=%'d


I really don't understand why block size is 512 by default. It should really be 1 by default.

Except for someone with an ancient hard disk who thinks in blocks instead of (mega, giga, etc...)bytes, who ever needs or wants that?


512 byte blocks for ls is specified by POSIX. That would probably be a painful change now.


Why not

    alias ls="ls --block-size=\'1 --color=auto"


Mr. Lehey managed to improve the system in such a way that it will subsequent changes for him and others easier, independently of whether the specific change to `ls` is never adopted. It's “five whys” applied to “why is this hard” and “how can I make it easier”. It's more effort, with a greater chance that much of it will survive the current context and requirements.

Some people improve the area they travel through, others leave debris, and many are noops who make no difference to those who come after. If there's not enough entropy fighters like Mr. Lehey working a system, it turns to kipple.


In case anyone wonders, I recently looked into how locales work with respect to LANG, LC_ALL and LC_*: http://c.i3wm.org/6799926

By the way, by looking at http://www.lemis.com/grog/index.php you can see that the author uses FreeBSD, just in case you were wondering about /usr/src


Most annoying is that gcc warns about perfectly valid and logical code. That causes people to ignore warnings, and before you know it, you have a piece of software that has more warnings than lines of code.

Alternatively, when you cleverly figure out how to work around the warning, like the author does, you now prevent that rule from triggering even when it's right. Clearly a better unit test is needed.


The printf ' format specifier is not Standard C. It's in neither the C99 Standard nor the new C11 Standard. So it's not actually valid, and it's a coincidence if it happens to work with your Standard Library.

Consider that the compiler generating that warning knows only Standard C, and in fact you could be pairing it with any C library, including those that are strictly conforming and don't support the ' extension.


You're arguing that the GNU C Library is incompatible with the GNU Compiler Collection?

(In the example, he's compiling ls with the -std=gnu99 flag, which means he's targeting GNU, not C99 or C90 or C11 or POSIX or any other standard.)


The printf(3) manpage now says: "Note that many versions of gcc(1) cannot parse this option and will issue a warning."

FWIW, GCC 4.7.2 parses the format string without warnings.


> Alternatively, when you cleverly figure out how to work around the warning,

Or just read the docs (it should be "%'*jd "). Then no warnings. (IIRC ' is in C99 and -std=gnu99 targets c99 + gnu extensions.)

The same story with the rest. Two ways of doing things — learn & think and just do it right or twiddle until it seems to likely maybe work (possibly). The article is about the latter. Plus "blame the compiler".


That warning seems like a nasty hack anyway, if the compiler can't inline a local when running safety checks.

It is super scary that the compiler appears to be using a different constant from printf for its format checker, that shows it probably isn't using a pattern supplied by printf.


Yes to both points. I haven't read the source code, but this feels like a case of code-oriented-programming instead of data-oriented-programming. In other words, they write printf twice: once in the C library, and once in the warning system. A more careful programmer might write both in terms of a ruleset that's declared a single time.

(Or they did that and it's just a bug somewhere.)


Compiler and standard library are two separate codebases, in fact gcc gets routinely used with standard libraries other than GNU's. Reimplementing printf parsing probably is the cleanest solution.


No reason that glibc can't include a validate_format_string routine to be run at compile-time by gcc. There are already so many conditional compilation sections in both codebases that another #ifdef GNU in gcc isn't going to hurt anyone :)


Incidentially, I completed ls's set of -a-z options recently.

http://joeyh.name/~joey/blog/entry/ls:_the_missing_options/

(Well, actually, I never got around to writing -z, but it's clear what it should do, and any ls hackers are encouraged to finish that up.)


Your link is broken, FYI



While I appreciate the story, what's wrong with `ls -lh`?


For me, -h makes it more difficult to quickly compare the sizes of files in one list by glance. This is something that I have to do often enough that it has prevented me from adding -h to my ls alias. I'd have to use it for a bit to be sure, but the post's suggestion seems a pretty good 'best of both worlds' solution to me.


This.

I run into the situation more often with 'du' (when trying to find which subdirectory tree has excess junk in it), to the point that, while 'du -h' is human readable, it's not particularly sortable so:

    du -hs $( du -s * | sort -k1nr,1 -k2 | head )
.... which will return the human-readable output, based on numerically sorting the full numeric output. Eyeball comparisons are easier as you're aware that results are already sorted by size.


A reasonably recent sort from GNU coreutils has a -h (and equivalent long option --human-numeric-sort) which properly sorts the output of du -h, meaning you can do:

du -h | sort -h

And get properly size-sorted output.


TIL! Thanks.


I guess people can be very different in this respect. I really need the full number (or at least all of the numbers in the same unit) to be displayed. I don't find the 'human' format helpful at all when looking at ls output.


It just depends what info you desire. I typically use "ls -lh" to quickly find large files to clean up to recover disk space.


I use "du <directory> -h -d 1|sort -h" for this because large files may be nested in directories and ls doesn't display the size of content directories (as far as I know). The output is sorted. Note that the "-d" flag for "du" doesn't work with all versions, but there were similar flags on all systems.


I don't usually have access to a version of "sort" that supports the "-h" flag.


My thoughts exactly. This is just complexity for complexity's sake. Useful as an exercise, but the -h flag already does this is an even more readable manner.


This works with sorting, and it's easier to pick big files out at a glance. I would use this as often as -h


`ls -lhS` works just fine to sort with human-readable sizes, at least on Fedora.


You could also use 'ls -lh | sort -h'.


I believe that's only supported in newer versions of sort.


gobble.wa@gmail.com made a similar post to the freebsd-questions mailing list a month ago. In his case the question was how to print an md5sum along with the file names in a given directory. I saved it because I thought it was a clever hack.

http://lists.freebsd.org/pipermail/freebsd-questions/2012-Se...

A lot of times I catch myself in the mindset of taking a step back and saying "here are the set of tools I have at hand to accomplish a task" without realizing that I should simultaneously be taking a step "in"--so to speak--and acknowledging that the tools I have to work with are not immutable tools cast of iron; they are malleable and can be re-tooled to suit my purposes.. and that sometimes going that route can be the simplest--and in fact "best"--solution.


Here's a wrapper I wrote for ls a while ago that allows you to spell "--color" as "--colour":

http://ubuntuforums.org/showthread.php?t=684239


Or, you could use 'ls -h'...

(that said, I do see the utility, since it gives a more obvious visual queue as to the order of size differences... but if you're doing anything with the sizes programatically, you have to remove the commas afterwards... Short version: if you're going to do this, make it a unique flag, or a new flag modifier to the -l flag... don't overload the -l flag without recourse...)


Ideally a parser should respect locale, and use a sane format (not commas) for multiple numbers in a list.

Even better if the _ separator used by programming languages were a supported locale LC=C_FOR_HUMANS :-)


FYI, there is no need to change GNU ls to get that behavior. You can make it use your locale's separator with either the --block-size="'1" option or by setting the LS_BLOCK_SIZE envvar to that same string:

    $ LC_ALL=en_US.UTF8 ls -og --block-size="'1" .
    -rw-------. 1 5,145,416 Oct  5 16:44 A
    -rw-------. 1 5,137,692 Oct  4 14:37 B
    -rw-------. 1 5,147,168 Oct  8 07:52 C
This feature is documented in the "Block size" section of the coreutils manual: i.e., you can type this to see it:

    info coreutils 'block size'


Now let's consider software lifecycle in a large context: longevity of forks.

If he doesn't send the changes off to upstream, and make a case good enough for them to be approved, then all this dooms him to maintaining his fork on all the platforms where he wants it until he gets sick of it or convinces someone else to do it for him.


FYI, Greg Lehey is a longtime FreeBSD committer.


Man, such fragile stuff. Why not code a function yourself that turns a number into a string representing it decimally with the commas every three digits. I normally like and use good library functions and standards, but if they're that fragile and depend on your environment then no thanks.


Because that would not be correct in Germany or other locales which use . for digit group separation and , for the decimal separator.


not everyone uses commas to separate their number groupings. his solution will work for any locale.


Yes, in my country we use spaces to separate thousands and "," is used for what the decimal point does in the US. In my handwriting I use that notation.

However, from a computer, I (and I'm certain I'm not the only one) actually expect it to output the US notation.

I'd much rather have a computer always output the same format (and that happens to be the US format), than try to be smart with locales, when the end result is that some things will do this, others that. Makes stuff harder to use, and when programming, harder to parse.

I've once had to touch Excel on a Windows machine configured for a non-US language, and it refused to import a CSV file that had commas, even though CSV means comma separated values. It required semicolons due to the locale settings of Windows. This stuff should not happen. A CSV is meant for computers, and to be interchangeable, not to use different types of commas and refuse to work with other types depending on user locale settings...

Of course, when publishing or printing, that's a whole different matter, and there it better get the locale of your country perfect. But this here was about output in the console, which is often meant as input of other scripts etc...


For such large numbers, would it make more sense to use groups of 6 instead of 3? This would allow you to easily identify the megabyte position with the next separator at the terabyte position.


Man, this sounds like every change I try to make to "legacy" code. There's so much debt and smell. I find it very, very hard to leave alone.


Surely the appropriate option character for this new, human-readable output is "-h".

Makes you wonder whether anyone ever considered the problem before...


Human readable form in bytes? I thought it you had a file greater than a megabyte hen it shows in MB? What if you want to read it in bytes with the thousands separator?


i always use one hack for ls. alias lsd="ls -ltrF | grep ^d"

This way, I quickly run lsd to only look for directories.


That's funny, I also have an alias for lsd which does this. But you can do it without grep: lsd='ls -d */'


Yours doesn't show . directories and parents does.


But lsd makes you see things that aren't even there.


-h


Not for files greater than a MB.


Mountain, meet molehill.


Is this front page worth materiel? I mean, it's good you took the time to add a ', but doing the same with sed would have been faster.

Alternatively, do you know about ls -lhrS? It will print size in human formats and reverse sort the files by size - ie the bigger will be at the end of the list


Yes, it is front-page worthy, because the details give us insight into how one might make any sort of change to a codebase like this. "If you want to be a hacker, these are specific steps I took to hack something up." In a domain that most of us are not very familiar with!


More to the point, it also shows that an "easy" change isn't so easy. In the course of solving this "easy" change, the author added six new items to his personal to-do list.

As edw519 has said, a one line code change takes six days to implement.


Editing ls/print.c sourcecode and recompiling is not hacking in my definition.

I usually consider the portability of the solution. I have linux i386 and x64 machines, my arm n900, an osx latop, etc.

Recompiling ls (or, heavens forbid, cross compiling!) for each machine may be a bright idea.

Adding a line to your profile that will take advantage of the existing tools like sed is closer to hacking in my definition, because it tries to think about the bigger problem - but still I wouldn't dare calling the following "hacking":

echo "alias lll=\"ls -l | sed -e :a -e 's/\(.*[0-9]\)\([0-9]\{3\}\)/\1,\2/;ta'\"">> ~/.bashrc


Sure, this is fair. I tend to use "hacking" to mean "tinkering", and it could be said that I have a fairly loose usage of the word.

To me, this article gets to the crux of what I find particularly delightful about hacking (tinkering?): unraveling layers of complexity underneath. I feel like I have a little better understanding of what's happening when I punch in `ls`, and I think that particular delight and knowledge is the kind of thing that appeals to tinkerers (hackers? :p) like myself. So - in my view - entirely appropriate for this crowd!

But you have a fair point; there's nothing particularly out-of-the-ordinary of this code or process, and in that sense, isn't newsworthy to hackers.

(I did not downvote you incidentally; I think it's interesting to get a sense of peoples' different thresholds for what constitutes "hacker." I play the saxophone, and an instructor once told me that people always came up to him and said "I want to be a musician. How can I do that?" Well it turns out that the moment you play "hot cross buns" on your instrument, you are indeed a musician. Perhaps not a skilled one, but you have in fact made music. I think of hacking in a similar way, and freely admit that it is a loose use of the word!)


I can totally agree with calling that tinkering - and also that it is interesting, if only (due to my diverging opinion on the merits of the approach) as a warning tale about how far one should go to try and fix a problem.

But, just like you, I consider that not newsworthy to hackers, yet at the moment it is the #1 item on HN and it kinda makes me sad especially because of the threshold - the idea that some people do consider that hacking - here of all places - is chilling :-/

Worse - #2 item is "more people should write". I beg to differ- more people should code, so that fixing a printf and recompiling ls wouldn't be newsworthy.

I'm sorry if it was interpreted as being rude- it was not the point - I just wanted to present alternative approaches to the problem, because reconsidering the problem is sometimes the right thing to do, especially when it escalate quickly in complexity.


I think you're complaining too much about this momentarily being the top story on HN: (it feels to me as if) usually the top story is gossip about how much money some startup raised or an Apple vs Android piece. I find this to be an improvement.


In what way is this not hacking?


It's definition under my definition of 'hacking.' Consider that the GNU tools already implement non-portable extensions; this would be yet another one, were it added. Also, portability is not an essential requirement for "hacking." I do small programming exercises sometimes in order to understand a facet of how things work. Consider this as an exercise in how to use locale-dependent specifiers.

As I pointed out elsewhere, your sed code doesn't work because it changes too many numbers - including filenames and dates - in the output. Also, if the exercise is to understand localization then your sed code isn't appropriate because it hard-codes "," when some locales use a "." as the thousands separator.

And for completeness, your alias can't then mix and match other flags, like "ls -lRt". It's a single command with strange side-effects if you use it incorrectly:

    % lll -art
    sed: illegal option -- r
    usage: sed script [-Ealn] [-i extension] [file ...]
           sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]


Hmmm - the sed code was written to give a working 1 line example with the answer- just an example, or else some people might consider that it could be high-end wizardy worthy of recompilation.

Instead of complaining about an obvious flaw in the example, maybe you could be more constructive and fix it.

Hints:

- if you want to pass other flags, make it a function and use $@

- if you want to respect the filenames and dates, either fix the regex or write it in perl.

Shouldn't take you long - back from 1999 in perl FAQ:

http://www.perlmonks.org/?node_id=653

But now it's qualifies as hacking. I guess that's due to inflation.

This is so not hacker news.

Ask a perl golfer to make you a one-liner you can copy-paste in your profile if you absolutely need some working code.


You did not give a `working 1 line example with the answer.' It doesn't do the right thing, and I pointed out two failure cases.

My constructive criticism is that your approach is wrong, should not be done, and cannot be easily fixed. One should never attempt to process the general output of ls. It's doable - I lived through the years of processing the "list"/"ls" output from random ftp servers - but it's nasty. Sure, you can add '$@' but then you have to worry about, say, "-i", which shows the inode number as a new leading column or "-n" which shows user/group ids instead of names. How does your alias/perl script figure out which column is the one which needs the commas?

You'll either end up with a very fragile system (producing the erroneous output as your 1-liner does) or you'll end up trying to understand most of the ls command-line arguments and/or heuristics to guess based on the output. The well-known "BUGS" section of the Unix man page says "To maintain backward compatibility, the relationships between the many options are quite complex." You're in for a long slog if you go this route.

Yes, if you want a one-off solution for a specific set of outputs then your approach would work. That would be also be boring and trivial. The linked-to article, on the other hand, was interesting.


Anyone who has ever had to work on a large system will instantly recognize what he's done here, and admire his approach, and light a candle in support of ever getting this patched all the way through the various broken projects. Remember, he's got to get a patch approved for gcc to get this to work, and only once that is done, in production, and generally available will he be able to submit a patch for ls. This is the reality of disparate projects with many maintainers and schedules. This may seem to be a trivial example, but it is entirely indicative of the process.


Last sentence in the post says it all. Hacking is more than just coding and the steps he followed are illustrative in general.


Yes, because it's interesting, and because for the OP in particular they've spent a little bit of effort up front to save themselves time in the long run, and learn something too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: