Hacker News new | past | comments | ask | show | jobs | submit login
Why not parse `ls` and what to do instead (unix.stackexchange.com)
170 points by nomilk 3 months ago | hide | past | favorite | 205 comments



I think that when someone uses ls instead of a glob it means they most probably don't understand shell. I don't see any advantage of parsing ls output when glob is available. Shell is finicky enough to not invite more trouble. Same with word splitting, one of the reasons to use shell functions, because then you have "$@" which makes sense and any other way to do it is something I can't comprehend.

Maybe I also don't understand shell, but as it was said before: when in doubt switch to a better defined language. Thank heavens for awk.


  > I think that when someone uses ls instead of a glob it means they most probably don't understand shell.
In 25 years of using Bash, I've picked up the knowledge that I shouldn't parse the output of ls. I suppose that it has something to do with spaces, newlines, and non-printing characters in file names. I really don't know.

But I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand. And ls is the only tool that I use to list files by hand - so I find it natural that people would (naively) pick ls as the tool to do that in scripts.


What I don't understand is why in the two most popular Unix flavors we have not got something like a json list output or something else that is parsable.

Is it really that difficult to add --json as a flag?


The soul of UNIX is to create a confused and in some cases unparseable text-based format from what was already structured data.


i think it boils down to then is dependencies needed to parse the json, coupled with the fact that glob syntax already covers iterating over the files regardless of characters used in the filename.

there are other tools than `ls` with their soul purpose to list files; some have "improved" features than ls, ect

similarlly from above (and well said btw, even what is not quoted): > I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand

a lot of us do similar and we know/expect the ins and outs; and all it takes to break our scripts is some edge case we never thought of. we are fortunate that our keyboard layouts are basically ascii; other languages are less fortunate. now introduce open source community driven software where an ls escaped bash code deletes somebodys home directory (as ls was parsed and an edge case of some users files cause some obscure "fun" times). an edge case is still painful..

and finally, sometimes its better elevating said bash script to python (or awk), ect. just depends on the situation and level of complexity of logic


Globs don't help much when I want file attributes and sizes. Yeah I can pipe to something that can do it, but it would be nice to just get filenames, attrs, size, dates in a json array as output.

Look ls is one of the most basic and natural Unix commands. Make it modern and useful.

Bash gibberish is fun for gatekeeping scripting neckbeards, but it's not what a proper OS should have.


the thing with the coreutils is they provide basic core functionality; you dont need bash on your system - `ls` is not bash (and then you still end up with busybox where json still would not be part of ls). add more utilities to your system to do more complex logic; ive used similar apps to this in the past: https://github.com/kellyjonbrazil/jc

there's also using zero terminiated lines in ls with `--zero`; then piping that to a number of apps which also support similar (read,xargs,ect)

might also checkout powershell on linux which may suite your needs where instead of string manipulation, everything is a class object


It's not in any of the major distress but shout out to csv-nix-tools for a valiant effort in this space https://github.com/mslusarz/csv-nix-tools


exactly—well said


People new to *nix make the mistake of thinking this stuff is well designed, makes sense and that things work well together.

They learn. We all do.


Coincidently I discovered the unix haters handbook today:

https://web.mit.edu/~simsong/www/ugh.pdf


"The Macintosh on which I type this has 64MB: Unix was not designed for the Mac. What kind of challenge is there when you have that much RAM?"

Love it.


I don't understand what they mean in that quote. Neither Unix nor the Mac were designed for that much RAM.


Judging from the context, the user interface was fine in the days of limited resources (a 16 kiloword PDP-11 was cited) but then modern computers have the resources for better user interfaces.

They clearly didn't realize that even more modern Unix kernels would require hundreds of megabytes just to boot.


What kernel takes 200 MB+ to boot?


OT ... I worked with Simson briefly ages ago. Smart dude. This book happened later and I've never seen it before. Small world I guess.


People new to *nix don't realize that it's a 55 year old design that keeps accumulating cruft.


Of course, but the same (with a bit lower number of years) can be said about Windows, or HTTP, or the web with its HTML+JS+CSS unholy trinity, or email, or anything old and important really. It's scary how much of our modern infrastructure hinges on hacks made tens of years ago.


One of the original demos showing off PowerShell was well structured output from its version of ls.

That was 17 years ago!


People new to the internet think alike. Still, not a day passes and we are once again reminded how fragile yet amazing this all information theory stuff is.


I went through a phase when I really enjoyed writing shell scripts like

  ls *.jpg | awk '{print "resize 200x200 $1 thumbnails/$1"}' | bash
because I never got to the point where I could remember the strange punctuation that the shell requires for loops without looking up the info pages for bash whereas I've thoroughly internalized awk syntax.

Word is you should never write something like that because you'll never get the escaping right and somebody could craft inputs that would cause arbitrary code execution. I mean, they try to scare you into using xargs, but I find xargs so foreign I have to read the whole man page every time I want to do something with it.


Better is something like

  find . -maxdepth 1 -name "*.jpg" -exec resize 200x200 "{}" "thumbnails/{}" \;
which works for spaces and probably quotes in filenames I am not sure about other special characters.


It's tough to be portable and have a one liner See https://stackoverflow.com/questions/45181115/portable-way-to...

I switched the command to a graphics magick based resize since that's the tool these days, default quality is 75% (for JPEG), but is included as a commonly desired customization. ,, is from a different comment in this thread; it seems better self-documenting than the single , I'd traditionally use.

  find . -maxdepth 1 -name "*.jpg" -print0 |\
  xargs -0P $(nproc --all) -I,, gm convert resize '200x200^>' -quality 75 ,, "thumbnails/,,"


I encourage you to give it a try again. Almost every use of xargs that I ever did looked like this:

ls *.jpg | xargs -i,, resize 200x200 ,, thumbnails/,,

I just always define the placeholder to ,, (you can pick something else but ,, is nice and unique) and write commands like you do.


I'm more likely to write that like:

  for i in *.jpg; resize 200x200 "$i" "thumbnails/$i"; end


Does that not fail when you hit the maximum command line length? Doesn't the entirety of the directory get splatted? Isn't this the whole reason xargs exists?


No, it does not fail. Maximum command line length exists in the operating system, not the shell; you can't launch a program with too many argc and you can't launch a program with an argv that's a string that's too long.

But when you execute a for loop in bash/sh, the 'for' command is not a program that is launched; it's a keyword that's interpreted, and the glob is also interpreted.

Thus, no, that does not fail when you hit the maximum command line length (which is 4096 on most _nix). It'll fail at other limits, but those limits exist in bash and are much larger. If you want to move to a stream-processing approach to avoid any limits, then that is possible, while probably also being a sign you should not use the shell.


That's right. I tested this just now in a directory with 1,000,000 files:

  $ for i in *; do echo $i; done | wc -l
  1000000
I'm a little bummed that it failed in fish shell, but wouldn't begrudge the author if they replied "don't do that".


The for loop only runs resize once per file. So no, the entire directory does not get splatted. It is unlikely you'd hit maximum command length.

At least on mac, the max command length is 1048576 bytes, while the maximum path length in the home directory is 1024 bytes. There might be some unix variant where the max path length is close enough to the max command length to cause an overflow, but I doubt that is the case for common ones.

xargs exists in an attempt to be able to parse command output. You could for instance have awk output xargs formatted file names to build up a single command invocation from arbitrary records read by awk. Note that xargs still has to obey the command line length limit though, because the command line needs to get passed to the program. Thus, in a situation where this for loop overflows the command line, it would cause xargs to also fail. Thus I would always use globbing if I have the choice.

EDIT: If you mean that the directory is splatted in the for loop, then in a theoretical sense it is. However, since "for" is a shell builtin, it does not have to care about command line length limits to my knowledge.


Yes, this is an issue, absolutely.

I've seen some image directories with more than a million files in them.


This shouldn't overrun the command line length for resize, since resize only gets fed one filename at a time. I do think that the for loop would need to hold all the filenames in a naive shell implementation. (I would assume most shells are naive in this respect) The for loop's length limit is probably the amount of ram available though. I find it improbable that one could overflow ram with purely pathnames on a PC, since a million files times 100 chars per file is still less than a gig of ram. If that was an issue though, one would indeed have to use "find" with "-exec" instead to make sure that one was never holding all file names in memory at the same time.


Exactly, there are so many limits in the shell that I don’t want to be bothered to think about. When I get serious I just write Python.


I just use find. it's a little longer but gives me the full paths and is more consistent. also works well if you need to recurse. something like `find . -type f | while read -r filepath; do whatever "${filepath}"; done`


I love this example, because it highlights how absolutely cursed shell is if you ever want to do anything correctly or robustly.

In your example, newlines and spaces in your filenames will ruin things. Better is

    find … -print0 | while read -r -d $'\0'; do …; done
This works in most cases, but it can still run into problems. Let's say you want to modify a variable inside the loop (this is a toy example, please don't nit that there are easier ways of doing this specific task).

    declare -a list=()

    find … -print0 | while read -r -d $'\0' filename; do
        list+=("${filename}")
    done
The variable `list` isn't updated at the end of the loop, because the loop is done in a subshell and the subshell doesn't propagate its environment changes back into the outer shell. So we have to avoid the subshell by reading in from process substitution instead.

    declare -a list=()

    while read -r -d $'\0' filename; do
        list+=("${filename}")
    done < <(find … -print0)
Even this isn't perfect. If the command inside the process substitution exits with an error, that error will be swallowed and your script won't exit even with `set -o errexit` or `shopt -s inherit_errexit` (both of which you should always use). The script will continue on as if the command inside the subshell suceeded, just with no output. What you have to do is read it into a variable first, and then use that variable as standard input.

    files="$(find … -print0)"
    declare -a list=()

    while read -r -d $'\0' filename; do
        list+=("${filename}")
    done <<< "${files}"
I think there's an alternative to this that lets you keep the original pipe version when `shopt -s lastpipe` is set, but I couldn't get it to work with a little experimentation.

Also be aware that in all of these, standard input inside the loop is redirected. So if you want to prompt a user for input, you need to explicitly read from `/dev/tty`.

My point with all this isn't that you should use the above example every single time, but that all of the (mis)features of shell compose extremely badly. Even piping to a loop causes weird changes in the environment that you now have to work around with other approaches. I wouldn't be surprised if there's something still terribly broken about that last example.


You have really proven your point even more than you meant to. Unfortunately none of these examples are robust.

The "-r" flag allows backslash escaping record terminators. The "find" command doesn't do such escaping itself, so that flag will cause files with backslashes at the end to concatenate themselves with the next file.

Furthermore, if IFS='' is not placed before each instance of read, or set somewhere earlier in the program, than each run of white-space in a filename will be converted into a single space.

EDIT: I proved your point even more. The "-r" flag does the opposite of what I thought it did, and disables record continuation. So the correct way to use read would be with IFS='' and the -r flag.


Love it. And I wouldn’t be surprised in the least if even this fell apart in some scenarios too.


Wow, you people are really young.

http://www.etalabs.net/sh_tricks.html


Is there a reason to prefer `while read; ...;done` over find's -exec or piping into xargs?


Both `find -exec` and xargs expect an executable command whereas `while read; ...; done` executes inline shell code.

Of course you can pass `sh -c '...'` (or Bash or $SHELL) to `find -exec` or xargs but then you easily get into quoting hell for anything non-trivial, especially if you need to share state from the parent process to the (grand) child process.

You can actually get `find -exec` and xargs to execute a function defined in the parent shell script (the one that's running the `find -exec` or xargs child process) using `export -f` but to me this feels like a somewhat obscure use case versus just using an inline while loop.


I will sometimes use the "| while read" syntax with find. One reason for doing so is that the "-exec" option to find uses {} to represent the found path, and it can only be used ONCE. Sometimes I need to use the found path more than once in what I'm executing, and capturing it via a read into a reusable variable is the easiest option for that. I'd say I use "-exec" and "| while read" about equally, actually. And I admittedly almost NEVER use xargs.


This will fail for files with newlines.


How common are they?


This whole post is about uncommon things that can break naive file parsing.


When you don't want to waste your time and sanity and happiness being in doubt and then throwing away all you've done and switching to a new language in mid stream, just don't even start using a terribly crippled shell scripting language in the first place, also and especially including awk.

The tired old "stick to bash because it's already installed everywhere" argument is just as weak and misleading and pernicious as the "stick to Internet Explorer because it's already installed everywhere" argument.

It's not like it isn't trivial to install Python on any system you'll encounter, unless you're programming an Analytical Engine or Jacquard Loom with punched cards.


In most places where I run shell scripts, there is no Python. There could be if I really wanted it but it's generally unnecessary waste.

On top of it, shell is better than Python for many things, not to mention faster.

It's also, as you mentioned, ubiquitous.

In the end, choose the tool that makes more sense. For me, a lot of the time, that's a shell script. Other times it may be Python, or Go, or Ruby, or any of the other tools in the box.


A waste of what, disk space? I'd much rather waste a few megabytes of disk space than hours or days of my time, which is much more precious. And what are you doing on those servers, anyway? Installing huge amounts of software, I bet. So install a little more!

For decades, on most Windows computers I run web browsers, there's always Internet Explorer. So do you still always use IE because installing Chrome is "wasteful"? It's a hell of a lot bigger and more wasteful than Python. As I already said, that is a weak and misleading and pernicious argument.

So what exactly is bash better than Python at, besides just starting up, which only matters if you write millions of little bash and awk and sed and find and tr and jq and curl scripts that all call each other, because none of them are powerful or integrated enough to solve the problem on their own.

Bash forces you to represent everything as strings, parsing and serializing and re-parsing them again and again. Even something as simple as manipulating json requires forking off a ridiculous number of processes, and parsing and serializing the JSON again and again, instead of simply keeping and manipulating it as efficient native data structures.

It makes absolutely no sense to choose a tool that you know is going to hit the wall soon, so you have to throw out everything you've done and rewrite it in another language. And you don't seem to realize that when you're duct-taping together all these other half-assed languages with their quirky non-standard incompatible byzantine flourishes of command line parameters and weak antique domain specific languages, like find, awk, sed, jq, curl, etc, you're ping-ponging between many different inadequate half-assed languages, and paying the price for starting up and shutting down each of their interpreters many times over, and serializing and deserializing and escaping and unescaping their command line parameters, stdin, and stdout, which totally blows away bash's quick start-up advantage.

You're arguing for learning and cobbling together a dozen or so different half-assed languages and flimsy tools, none of which you can also use to do general purpose programming, user interfaces, machine learning, web servers and clients, etc.

Why learn the quirks and limitations of all those shitty complex tools, and pay the cognitive price and resource overhead of stringing them all together, when you can simply learn one tool that can do all of that much more efficiently in one process, without any quirks and limitations and duct tape, and is much easier to debug and maintain?


> For decades, on most Windows computers I run web browsers, there's always Internet Explorer. As I already said, that is a weak and misleading and pernicious argument.

On its own, I agree. But you glossed over everything else I said, so I'm not going to entertain your weak argument.

You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.

Like I said, for most of my use cases where I use shell scripting, it's the best tool for the job. If you don't believe me, or think you know better about my circumstances than I do, all the power to you.


> You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.

I have worked on projects that are extremely sensitive to extra dependencies and projects that aren't.

Sometimes I am in an underground bunker and each dependency goes through an 18 month Department of Defense vetting process, and "Just install python" is equivalent to "just don't do the project". Other times I have worked on projects where tech debt was an afterthought because we didn't know if the code would still be around in a week and re-writing was a real option, so bringing in a dependency for a single command was worthwhile if we could solve the problem now.

There is appetite for risk, desire for control, need for flexibility, and many other factors just as you stated that DonHopkins is ignoring or unaware of.


Plus jq and curl might not even be installed. And I never got warm with jq, so if I need to parse json from shell I reach for... python. Really.


Alternatively, maybe you can get warmer with JMESPath, which has jp as its command line interface https://github.com/jmespath/jp .

The good thing about the JMESPath syntax is that it is the standard one when processing JSON in software like Ansible, Grafana, perhaps some more.


I'm an avid jq user. There are certainly situations where it's better to use python because it's just more sane and easier to read/write, but jq does a few things extremely well, namely, compressing json, and converting json files consisting of big-ass arrays into line delimited json files.


One advantage: `ls -i` gives you the file's inode in a POSIX portable way. If you glob and then look it up individually for each file, you'll need to be aware of which tool (and whether it's GNU or BSD in origin) you use on which platform.

In general yes globbing is better for iterating through files. But parsing `ls` doesn't necessarily mean the author doesn't know shell. It might mean they know it well enough to use the tools that are made available to them.


Commands can have a maximum number of arguments. Try globbing on a directory with millions of files.


Usually the pattern is "for f in [glob]", which doesn't have that issue. Running "ls" on a directory is little more than "for f in *; echo $f" so there's little advantage to using "ls".

Also: "find -exec {} \+" will take ARG_MAX into account, and may be much faster depending on what you're doing.


Sane people will just use find and/or xargs.


Weird thing to call sane when its the shell that is insane, or more likely an instrument of torture.


It's not the case very often these days but it used to be quite simple to blow up your script globbing in a directory with a lot of files and you can still hit the limit if you pass a glob to some command because it can blow up trying to execve() Here's more details of the issue and some workarounds https://unix.stackexchange.com/questions/120642/what-defines...


Sometimes I want all filenames from a subdirectory, without the subdirectory name.

I can do (ignoring parsing issues):

    for name in $(cd subdir; ls); do echo "$name"; done
This isn't easy to do with globbing (as far as I know)


One alternative:

  for name in subdir/*; do basename "$name"; done


Also since subdir is hardcoded, you can reliably type it a second time to chop off however much of the start you want:

  for name in subdir/subsubdir/*; do
    echo "${name#subdir/}"  # subsubdir/foo
  done


Note this string replacement is not anchored (right?) which can end up biting you badly (depending on circumstances of course).


It's anchored on the left. ${name#subdir/} will turn 'subdir/abc' into 'abc', but will not touch foo/subdir/bar. I don't think bash even has syntax to replace in the middle of an expansion, I always pull out sed for that.


Thanks for clarifying, I learned something new today!

Edit: It turns out that Bash does substitutions in the middle of strings using the ${string/substring/replacement} and ${string//substring/replacement} syntax, for more details see https://tldp.org/LDP/abs/html/string-manipulation.html


This is really easy to do with a shell pattern.

  $ x=/some/really/long/path/to/my/file.txt
  $ echo "${x##*/}"
  file.txt


I'd really like if the "find" command supported this much easier, so if I write

    find some/dir/here -name '*.gz'
then I could get the filenames without the "some/dir/here" prefix.

It would also be nice if "find" (and "stat") could output the full info for a file in JSON format so I could use "jq" to filter and extract the needed info safely instead of having to split whitespace seperated columns.


Why would you do this work when stat (and GNU find) can `printf` the exact needed information without any parsing?


If I need filesize and filename then I still need to parse a filename that might contain all kinds of weird ascii control characters or weird unicode.

JSON makes that a lot less fragile.


I don't get it; I need a concrete example.


It's at least pretty easy to shorten the filenames:

  cd some/dir/here
  find . -name '*.gz'
  cd - # changes back to previous directory


What about:

  find . -name '*.hs' -exec basename {} \;


You could get mixed up here because find is recursive by default and basename won't show that files might be in different subdirectories.


If you are gonna do a subshell (cd subdir; ls) you can wrap the whole loop:

  (cd subdir
  for name in *; do
    echo "$name"
  done)
But I prefer:

  for name in subdir/*; do
    name="${name#*/}"
    echo "$name"
  done


What to do instead: Use Nushell.

I finally started really using my shell after switching to it. I casually write multiple scripts and small functions per day to automate my stuff. I'm writing scripts I'd otherwise write in python in nu. All because the data needs no parsing. I'm not even annotating my data with types even though Nushell supports it because it turns out structured data with inferred types is more than you need day-to-day. I'm not even talking about all the other nice features other shells simply don't have. See this custom command definiton:

  # A greeting command that can greet the caller
  def greet [
    name: string      # The name of the person to greet
    --age (-a): int   # The age of the person
  ] {
    [$name $age]
  }
Here's the auto-generated output when you run `help greet`:

  A greeting command that can greet the caller

  Usage:
    > greet <name> {flags}

  Parameters:
    <name> The name of the person to greet

  Flags:
    -h, --help: Display this help message
    -a, --age <integer>: The age of the person
It's one of the software that only empowers you, immediately, without a single downside. Except the time spent learning it, but that was about a week for me. Bash or fish is there if I ever need it to paste some shell commands.


Parsing, or the lack thereof, is not the point. The point is that standard shells already provide all the tools you need for dealing with lists of files. Want to do something for every file? Write this:

    shopt -s nullglob
    for f in *; do
       …
    done
But never this:

    for f in $(ls); do
       …
    done
They look similar, but the latter runs ls to turn the list of files into a string, then has the shell parse the string back into a list. Even if the parsing was done correctly (and it isn’t), this is still extra work. Looping over the glob avoids the extra work.


I have to say this is very unintuitive. In Nushell, you'd do:

  ls | each { ... }
Another examples I don't need to explain, which would be far harder in stringly typed shells:

  ls | where type == file and size <= 5MiB | sort-by size | reverse | first 10

  ps | where cpu > 10 and mem > 1GB | kill $in.pid
It's immediately obvious what you need to do when you can easily visualize your data:

  > ls
  ╭────┬───────────────────────┬──────┬───────────┬─────────────╮
  │ #  │         name          │ type │   size    │  modified   │
  ├────┼───────────────────────┼──────┼───────────┼─────────────┤
  │  0 │ 404.html              │ file │     429 B │ 3 days ago  │
  │  1 │ CONTRIBUTING.md       │ file │     955 B │ 8 mins ago  │
  │  2 │ Gemfile               │ file │   1.1 KiB │ 3 days ago  │
  │  3 │ Gemfile.lock          │ file │   6.9 KiB │ 3 days ago  │
  │  4 │ LICENSE               │ file │   1.1 KiB │ 3 days ago  │
  │  5 │ README.md             │ file │     213 B │ 3 days ago  │
  ...


I didn’t say that nushell is bad, I said that it’s not relevant to the discussion. nushell provides typed data in pipelines, which is cool. But standard shells already have typed data for this particular use case, thus parsing untyped data is unnecessary. Of course it would be nice if that typed data could be used in a pipeline, but everything had to start somewhere.


Who are you to decide what's relevant to the discussion? It's very clearly on topic. I had never heard of nushell and I'm glad it was mentioned


How do I replace:

    for f in $(cd subdir; ls); do
       ...
    done

?


Either

  for f in subdir/*; do
    ...
  done
or

  ( 
  cd subdir || exit 1
  for f in *; do
    ...
  done
  )
work fine. However, I must insist against using `for` loops in favor of `find`.


Posts like these are like the main character threads on twitter where someone says, "men don't do x" or "women aren't like y." It just feels like people outside of you who have no understanding of your context seem intent on making up rules for how you should code things.

Perhaps it would help to translate this into something more like, "what pitfalls do you run into if you parse `ls`" but it's hard to get past the initial language.


When we say "don't do X" we mean "the obvious way is wrong". If you have enough knowledge to ignore the advice, you likely are already aware of the problems with the obvious solution.

I'm pretty sure you can come up with scenarios where parsing the output of "ls" is indeed the simplest solution, but that kind of article is supposed to discourage people who don't know better from going "oh, I know, I'll just parse the output of ls". As a general advice, people should indeed be pointed towards "man find" or "man opendir 3".


I think there's a middle point where you want to do something that's complex enough that a glob won't cut it but simple enough that switching languages is not worth it.

I think the example of "exclude these two types of files" is a good case. I often have to write stuff like `ls P* | grep -Ev "wav|draft"` which doesn't solve a problem I don't have (such as filenames with newlines in them) but does solve the one I do (keeping a subset of files that would be tricky to glob properly).

In my experience 95% of those scripts are going to be discarded in a week, and bringing Python into it means I need to deal with `os.path` and `subprocess.run`. My rule of thumb: if it's not going to be version controlled then Bash is fine.


You might enjoy a variety of `find` based commands, e.g. `find -maxdepth 1 -iregex ".*\.(wav|draft)" | xargs echo "found file:"`

This uses regex to match files ending in .wav or .draft (which is what I interpreted you to want). Xargs then processes the file. You could use flags to have xargs pass the file names in a specific place in the command, which can even be a one liner shell call or some script.

So the "find <regex> - xarg <command>" pattern is almost fully generally applicable to any problem where you want to execute a oneliner on a number of files with regular names. (I think gnu find has no extended regex, which is just as well- thats not a "regular expression" at that point)


> You might enjoy a variety of `find` based commands, e.g. `find -maxdepth 1 -iregex ".*\.(wav|draft)" | xargs echo "found file:"`

Find can even execute commands itself without using `xargs`:

     find -maxdepth 1 -iregex '.\.\(wav\|draft\)' -exec echo "found file:" {} \;


Definitely do it this way if you want to stick to the pre-filtered version (I recommend the cousin comment, filter inside the loop). GP's version is buggy in the same way as the post misunderstands, particularly with files that somehow got newlines in the filename (xargs is newline-delimited by default).

If for some reason you do need the "find | xargs" combo (maybe for concurrency), you can get it to work with "find -print0" and "xargs -0". Nulls can't be in filenames so a null-delimited list should work.


As an addendum, note that `-print0` and `-0` for find and xargs respectively are now in the latest POSIX standard, so their use is compliant.


The latest standard I know of is SuS 2018, which I have the docs for, and does not include either switch. I searched around a bit and it doesn't seem like there is a new one. Are you referring to some draft? I sure wish this was true.

That being said, I would interpret "-exec printf '%s\0' {} +" as being a posix compliant way for find to output null delimited files. I say this since the docs for the octal escape for printf allows zero digits. However, most posix tools operate on "text" input "files", which are defined as not having null characters. Thus I don't think outputting nulls could be easily used in a posix complaint way. In practice, I would expect many posix implementations to also not handle nulls well because C uses null to mean end of string, so lots of C library calls for dealing with strings will not correctly deal with null characters.


The 2024 standard is out, but behind a paywall; I don't know when they will update the Open Group website.


Geez. What's the point of writing standards if they won't let people read them (without having to pay for the privilege)?


>GP's version is buggy in the same way as the post misunderstands, particularly with files that somehow got newlines in the filename

I understand this caveat, but I never had a file with newline that I cared about. Everyone keeps repeating this gotcha but I literally don't care. When I do "ls | grep [.]png\$ | xargs -i,, rm ,," (yes, stupid example) there is 0% chance that a png file with a newline in the name found itself in my Downloads folder. Or my project's source code. Or my photo library. It just won't happen, and the bash oneliner only needs to run once. In my 20 years of using xargs I didn't have to use -0 even once.


See the other response I got, I misremembered (and waaay too late to edit) - it's whitespace, not newlines. I'm sure you've had files with spaces in the name.


>xargs is newline-delimited by default

Even worse, it is whitespace delimited (with its own rules for escaping with quotes and backslashes)


It is not, but (for reasons unknown to me) it doesn't quote parameters in the default mode. Consider:

touch "a b" ls | xargs rm # this won't work, rm gets two parameters ls | xargs -i,, rm ,, # this will work


https://pubs.opengroup.org/onlinepubs/9699919799/utilities/x...

>[..] arguments in the standard input are separated by unquoted <blank> characters [..]

As for -i, it is documented to be the same as -I, which, among other things, makes it so that "unquoted blanks do not terminate input items; instead the separator is the newline character."


Yeah, I misremembered. Here's an example, using "-n 1" so each split "thing" is passed to separate processes:

  $ printf "one two three\nfour five\n" | xargs -n 1 echo
  one
  two
  three
  four
  five


in zsh you can use:

  P*~*wav*~*draft*
This looks a bit obscure due to lack of spaces, but it's simpler than it seems; the pattern is:

  [glob] ~ [exclude glob]

  P* ~ *wav* ~ *draft*
The ~ being the negate operator, which can be added more than once.

It's essentially the same as "grep -Ev" or "find -iregex".

It's a lot less typing than find, and also something you're likely to use interactively once you're used to it, so it feels very natural.


It's not necessary to bring Python into it, Bash can handle filenames with weird characters properly if you know how to use it.

E.g. instead of `ls | grep -Ev 'wav|draft'`, you'd have to do something like

    for filename in *; do
        if grep -E 'wav|draft' >/dev/null <<< "$filename"
        then : # ...
        fi
    done
Of course, it's more convoluted, but when you're writing scripts that might be used for a long time and by many people, it helps to know that it is possible to write robust things. Tools like shellcheck certainly help.


At that point I think you need to ask yourself why you're using Bash to begin with. If it's just meant to be a quick script that's run occasionally then this is good but probably overkill. If it's going into prod to be run regularly as part of business critical, then it should be in a language that has a less convoluted way to _ls a directory_. There's an inflection point somewhere in there, where it is depends on you.


Am I monitoring the execution?

Yes: bash is probably fine.

No: real programming language time.


The above is perfectly fine for small directories, but in general the preferred way to loop over files is with find:

  find . ! -name . -prune                   \
           -exec grep -qE 'wav|draft' {} \; \
           -exec "${action}"             \; ;
Edit: I missed the herestring in the original code, so the above is wrong as mentioned in the comments; if your find has regex, you can use it to save one grep:

  find . ! -name . -prune              \
           -regex '.*wav.*\|.*draft.*' \
           -exec "${action}"        \; ;
Otherwise you can call sh to printf the filename into a grep.

However, the point of my post is that find can perform seek, filter and execute, and should be used for all three unless it is really impossible (which is unlikely).


Your example is grepping the file contents, where GP is using grep to select the filenames.


D'oh!


grep -q and you won’t need the redirect of stdout.


Before you write anything, you need to think about the cost of it breaking and the chance of it breaking, and Bash scripts in VC tend to maximize both. I like that heuristic a lot.


The title omits the final '?' which is important, because the rant and its replies didn't settle the matter.

Shellcheck's page on parsing ls links to the article the author is nitpicking on, but it also links to the answer to "what to do instead": use find(1), unless you really can't. https://mywiki.wooledge.org/BashFAQ/020


I guess this is for shell scripts that need to work with "unsafe" filenames?

I've been using Linux since 1999 and i never came across a filename with newlines. On the other hand, pretty much all "ls parsing" i've done was on the command-line to pipe it to other stuff in files i was 100.1% sure would be fine.


When teaching beginners shell, it's natural to teach `ls` for listing directory contents. It's also natural to extend from `ls` to `ls | ...` for processing lists of files

The important point to get across is that pipes let us build bigger commands from the commands we already know. If needed, you can back up later to teach patterns like `find [...] -exec`, `find [...] -print0 | xargs -0 [...]`, `find [...] | while read -r file; do [...] done` and so on.

There are all kinds of prerequisites to creating files with unusual names. Those barriers tend to mean beginners won't run into file name processing edge cases for a while. The exception will be files they download from the Internet. But the complexity there will usually be quote and non-ASCII Unicode characters, not newlines or other control codes.

In teaching, the one filename complexity I would try to get ahead of, preventively, is spaces. There was a time, way back when, when newbies seemed to expect to stick with short, simple filenames. These days they the people I've helped tended to be used to using spaces in file names in Finder and Explorer for office or school work.


I wrote a pipe-objects-instead-of-strings shell: https://marceltheshell.org.

Not piping strings avoids this issue completely. Marcel’s ls produces a stream of File objects, which can be processed without worrying about whitespace, EOL, etc.

In general, this approach avoids parsing the output of any command. You always get a stream of Python values.


> In general, this approach avoids parsing the output of any command.

Somewhere, there has to be validation phases. Just because you have objects, doesn't mean they are well formed.

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

It turns out proper validation is way harder than parsing. There is a reason text based interfaces and formats are so pervasive.


As long as Python does the right thing with globs, there is really no room for marcel to get it wrong. Not sure what additional validation you are thinking of.


Suppose you store the result somewhere and then want to use it afterwards.

How do you check if they are still valid files?


The File object encapsulates a path. If you use it, e.g. to read contents, and the file doesn't exist, then it will fail with an appropriate error message.

Files come and go. References to them go stale. Every user and tool deals with this. This isn't a "validation" issue.


Not sure how portable it is, but gnu ls has a flag to solve this problem trivially:

  --zero    end each output line with NUL, not newline


Not available in macOS `ls` unfortunately.


Why do you want to put LF bytes into filenames?

Using magic, I've renamed any files you have to remove control characters in the name and made it impossible to make any new ones. (You can thank me later.)

What can't you do now?


Preventing certain characters in filenames would solve a lot of issues, from security issues to wasted time all around.

But for whatever reason, when it is suggested, you get many people chiming in that "filenames should be dumb bytes, anything allowed except / !"


>But for whatever reason, when it is suggested, you get many people chiming in that "filenames should be dumb bytes, anything allowed except / !"

I guess the issue is not what filenames should be, but what filenames are. In general, when interacting with files, you have to expect everything but `/` and nullbyte. Even if you forbid it on your machine, someone may mount a NFS drive and open you to the world of weird filenames. And you never know who uses your code.

And the unicode itself is weird anyway - for example you may have normalised and denormalised names which may be the same or different string depending on how you look at them[1]. And I hope you are not planning to restrict filenames to some anglocentric [a-z0-9_- ]*, because the world is much larger and you can't pretend unicode doesn't exist.

[1] https://eclecticlight.co/2017/04/06/apfs-is-currently-unusab... and many other cases


I've put a file in my home directory named the entirety of your comment, newline included. Unfortunately I had to trim "except / !" to bring it to 255 characters.

Now at least when some tool or pipeline blows up horribly, it'll be hilarious.


Latest POSIX "encourages" implementations to forbid using newlines in pathnames.

https://austingroupbugs.net/view.php?id=251


Or use PowerShell where LS returns a bunch of objects, and say goodbye to string parsing forever.


nushell is the superior structured data shell and it's cross-platform. https://www.nushell.sh/


I've only used Powershell a little bit on Linux and Mac but it seems reasonably cross-platform.

On the surface, it looks like I'd be giving up the decently sized ecosystem of Powershell libraries for a new ecosystem without much support?

I'm interested in knowing what Nushell does differently since I'm wanting to find a better shell.


I'm probably not the best person to ask, since the last time I touched Powershell, it was Windows only, but I'd say nushell is likely a lot more platform-agnostic, has sane syntax and follows a functional paradigm. Plugins are written in Rust. It's probably not worth it if all you do is Windows sysadmin work, as you'd have to serialize and deserialize data when interacting with Powershell from nu.


Last I looked, powershell's startup time on linux was disappointing. Understandable to an extent given it was bootstrapping a bunch of dotNET stuff that would already be there on windows. But slow enough that I couldn't use or recommend it to my team.


Wait until you realize that "giving up the decently sized ecosystem of Powershell libraries" is a net positive ;-)


Nushell is way less powerful.



Borkdude has a wonderful Clojure/Babashka solution in this space: https://github.com/babashka/fs


What to do instead: use pwsh to completely obviate all these issues.


Alternative shells or higher languages don't solve _all_ the issues.

I won't install a new shell to generate a file list on my CI server. I won't install a new shell on remote machines. Ever.

These structured shells also require commands to be aware of them, either via some plugin that structures their raw I/O output or some convention. They solve _some_ command output structuring but not _all_ the general problem.

So, the answer is good. It promotes the idea that one should be careful when machine parsing output meant for humans.


> I won't install a new shell to generate a file list on my CI server. I won't install a new shell on remote machines. Ever.

Uh... that's on you? Why do you intentionally hinder yourself?

> These structured shells also require commands to be aware of them, either via some plugin that structures their raw I/O output or some convention. They solve _some_ command output structuring but not _all_ the general problem.

Okay. It doesn't solve literally every single problem, that is true. It's still miles ahead. And when interfacing with non-pwsh commands, you just fall back to text parsing/output.


> Uh... that's on you? Why do you intentionally hinder yourself?

Hinder myself? An ephemeral cloud machine would not keep my custom shell anyway. By having to install it _every single time I connect_ I just loose precious time.

I want to be familiar with tools that are _already_ installed everywhere.

The shell is supposed to be a bottom feeder, lowest common denominator, barely usable tool. That way, it can build soon and get stable real fast. That (unintentional) strategy placed it as a core infrastructural piece... everywhere.

Of course, there's scripting and using it on the terminal. But we're talking about scripting, right? Parsing ls and stuff. I want the fast, lean, simple `dash` to parse my fast, lean simple scripts. pwsh is fine for the terminal leather seats.


Ephemeral cloud machines are created from images. Build your own image with the tools you need.


Isn't it ironic that Powershell from Microsoft is so much vastly superior than bash, not because it's great or even better than Python, but because bash is such a terribly low bar to beat, that it totally undermines the "Unix Philosophy"?

Who would have thought that little old Microsoft, purveyors of MSDOS CMD.EXE, would have leapfrogged Unix and come out with something so important and fundamental as a shell that was superior to all of Unix's "standard" sh/csh/bash/whatever shells in so many ways, all of which historically used to be and ridiculously still are touted by Unix Supremacists as one of its greatest strengths?

You see, Microsoft is willing to look at the flaws in their own software, and the virtues of their competitors' software, then admit that they made mistakes, and their competitors did something right, and finally fix their own shit, unlike so many fanatical monolinguistic Unix evangelists.

They did the exact same thing to Java and JavaScript, leaving Visual Basic and CMD.EXE behind in the dustbin of history -- just like Unix should leave bash behind -- resulting in great cross platform languages like C# and TypeScript.

Edit: that reinforces my point that taking so long to get there is a hell of a lot better than taking MUCH LONGER to NOT get there.

Maybe bash's legacy inertia is a problem, not a virtue. Is certainly isn't getting a JSON parser in the foreseeable future. The ironic point is that even Microsoft's power shell has much less legacy inertia, and therefore is so much better, in such a shorter amount of time.


  > Isn't it ironic that Powershell from Microsoft is so much vastly superior than bash
I agree that powershell is now better than bash. But it took SO LONG to get there. Moreover, bash has had a 12 year head-start (ok, 30 if you count earlier unix shells). Bash has legacy inertia. Even though you can now supposedly run powershell in linux, I don't know anyone who does. Does anybody?

That said, I think powershell is great for utility-knife uses on windows machines.


> Even though you can now supposedly run powershell in linux, I don't know anyone who does. Does anybody?

I do. I replaced all of the automation scripts on my rpi with pwsh scripts, and I'm not regretting it. Not having to deal with decades of cruft in argument parsing and string handling, learning little DSLs for every command, etc. is so worth it.


At this point, all PowerShell has accomplished is creating a separate ecosystem. The designers set out to make a "better" shell and yet refused to ever learn the things they were allegedly "improving".

Basic features are still lacking from PowerShell that have been in UNIX shells since the very beginning: https://github.com/PowerShell/PowerShell/issues/3316

But hey, that's a fixable problem, right? No, because PowerShell is so suffused with arrogance about its superiority that anything, no matter how simple it was to do in a UNIX shell, has to be cross-examined, re-imagined, and bent over the wheel of PowerShell's superiority, before ultimately getting ignored or rejected anyway.

PowerShell is a language unto itself. It is not a replacement for bash/zsh/etc because nobody who knows the latter well can easily migrate to the former, and that's by design.


Some very strong sentiments about a shell...


I want there to be something better than the UNIX shells, at least when it comes to error handling and data parsing. PowerShell was supposed to be that tool, but it seems to have lost sight of that goal somewhere along the way.


Or, once it's API-stable, use nushell.


Python has had an API-stable module for listing directories for decades, you know.


If you're going to skip using the standard shell that is installed everywhere by default, then you should go ahead and use a full language with easily distributed binaries.


Do you recommend it? I feel like I'd get RSI from pressing shift when using it. https://learn.microsoft.com/en-us/powershell/module/microsof...


It's case insensitive for what it's worth. My main problem is trying to figure out which utilities they've bundled into which command.


Yes, I absolutely recommend it. I use it every day.

Commands and flags are case-insensitive.


Powershell is mostly case-insensitive, and most of the core cmdlets have short aliases. Try `Get-Alias` (or `gal`) to learn more.


Many people turn to globbing to save them, which is usually better, but has some problems in case of no matches. But, for Bash, you can do this to fix it:

  shopt -s failglob


I don't know, this seems like a lot of words to avoid coming to the conclusion that there are many ways to skin a directory.

Most of the time it's fine to just suck in ls and split it on \n and iterate away, which I do a lot because it's just a nice and simple way forward when names are well-formed. Sometimes it's nicer to figure out a 'find at-place thing -exec do-the-stuff {} \;'. And sometimes one needs some other tool that scours the file system directly and doesn't choke on absolutely bizarre file names and gives a representation that doesn't explode in the subsequent context, whatever that may be, which is quite rare.

A more common issue than file names consisting of line breaks is unclean encodings, non-UTF-8 text that seeps in from lesser operating systems. Renaming makes the problem go away, so one should absolutely do that and then crude techniques are likely very viable again.


Today I learned how neat find is:

  find ~/Music -iname 'p*' -not -iname '*age*' -not -iname '*etto*'
  find ~/Music -iname 'p*' -not -iregex '.*\(age\|etto\).*'
  find ~/Music -regextype posix-extended -iname 'p*' -not -iregex '.*(age|etto).*'
Not that I'm likely to ever use any of that in anger, but it's good to know if ever I do wind up needing it.


I wonder if anyone has implemented kernel module or smth to limit filenames to sane set. Just ensuring that they are valid utf8 and do not contain any non-printables would be huge improvement. Sure some niche applications might break so its not something that can be made default, but still I think it would help on systems I control.


These sorts of pedantic exchanges are so pointless to me. We are programmers. We can control what characters are used in filenames. Then you can use the simplest tool for the job and move on with your life to focus on the stuff that actually matters. Fix the root cause instead of creating workarounds for the symptom.


I feel like Unix utilities should provide a standardized way to generate machine-readable output, perhaps using JSON.


The same information is already available in a machine–readable format. Just call readdir. You don’t need to run ls, have ls call readdir and convert the output into JSON, and then finally parse the JSON back into a data structure. You can just call readdir!


I know, but it would be so great if __every__ Unix utility just had the same type of output. By the way, ls does more than just readdir.


Can you call readdir() from a shell easily?

WRT format, I'd prefer csv.


here is trivial program to dump dents to stdout, suitable for shell pipelines. example usage `./getdents64 . | xargs -0 printf "%q\n"`

    #define _GNU_SOURCE
    #include <dirent.h>
    #include <fcntl.h>
    #include <malloc.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #define BUF_SIZE 32768

    struct linux_dirent64 {
      ino64_t d_ino;           /* 64-bit inode number */
      off64_t d_off;           /* Not an offset; see getdents() */
      unsigned short d_reclen; /* Size of this dirent */
      unsigned char d_type;    /* File type */
      char d_name[];           /* Filename (null-terminated) */
    };

    int writeall(char *buf, size_t len) {
      ssize_t wres = 0;
      wres = write(1, buf, len);
      if (wres == -1) {
        perror("write");
        return -1;
      }
      if (((size_t)wres) < len) {
        return writeall(buf + wres, len - wres);
      }
      return 0;
    }

    int main(int argc, char **argv) {
      if (argc != 2) {
        return EXIT_FAILURE;
      }
      int fd = open(argv[1], O_DIRECTORY | O_RDONLY);
      if (fd == -1) {
        perror("open");
        return EXIT_FAILURE;
      }
      void *buf = malloc(BUF_SIZE);
      ssize_t res = 0;
      do {
        res = getdents64(fd, buf, BUF_SIZE);
        if (res == -1) {
          perror("getdents64");
          return EXIT_FAILURE;
        }
        void *it = buf;
        while (it < (buf + res)) {
          struct linux_dirent64 *elem = it;
          it += elem->d_reclen;
          size_t len = strlen(elem->d_name);
          if (writeall(elem->d_name, len + 1) == -1) {
            return EXIT_FAILURE;
          }
        }
      } while (res > 0);
      return EXIT_SUCCESS;
    }


You’re still doing unnecessary work. You’re turning a list of files into a string, then parsing the string back into words.

Your shell already provides a nice abstraction over calling readdir directly. A glob gives you a list, with no intermediate stage as a string that needs to be parsed. You can iterate directly over that list.

Every language provides either direct access to the C library, so that you can call readdir, or it provides some abstraction over it to make the process less annoying. In Common Lisp the function `directory` takes a pathname and returns a list of pathnames for the files in the named directory. In Rust there is the `std::fs::read_dir` that gives you an iterator that yields `io::Result<std::fs::DirEntry>`, allowing easy handling of io errors and also neatly avoiding an extra allocation. Raku has a function `dir` that returns a similar iterator, but with the added feature that it can match the names against a regex for you and only yield the matches. You can fill in more examples from your favorite languages if you want.


There is a glob() function you can use in POSIX C also to get an array of strings.

The getdents system call being used in the above program is the basis for implementing readdir.

It doesn't return a string, but rather a buffer of multiple directory entries.

The program isn't parsing a giant string; it is parsing out the directory entry structures, which are variable length and have a length field so the next one can be found.

The program writes each name including the null terminator, so that the output is suitable for utilities which understand that.


The problem is the phrase “suitable for shell pipelines”. If you are in a shell, you should not be doing anything like this. You should use a glob directly in the shell. You should not be calling an external program, having that program print out something, and then parsing it. Just use a glob right there in your shell script. If you do anything else, you are doing it wrong.

Do I really have to say this again?


Wow, these replies. I was being a little sarcastic as there is no 'readdir' shell command. That is all.


Certainly. Just do `for f in *`. See how easy that is?


`find` is also an option, or shell globs.


Right, globs are syntactic sugar on top of readdir. Definitely use them when you are in a shell. But in general the solution is to call readdir, or some language facility built directly on top of it. Calling ls and asking it for JSON is the stupid way to do things.


Just curious, how would you approach getting output from utilities like "df", "mount" and "parted"?


Generally speaking, can't you limit/define the output of those commands and parse them that way? like df --portability or --total or --output

And/or use their return codes to verify that something worked or didn't

Or hope your higher level programming language contains built-ins for file system manipulations


How is that any easier than just giving a standardized --json flag?


It doesn't require trying to organize a small revolution across dozens of GNU tools, many authors, and numerous distros...?

I'd love to see standard JSON output across these tools. I just don't see a realistic way to get that to happen in my lifetime.

Maybe a unified parsing layer is more realistic, like an open source command output to JSON framework that would automatically identify the command variant you're running based on its version and your shell settings, parse the output for you, and format it in a standard JSON schema? Even that would be a huge undertaking though.

There are a lot, LOT of command variants out there. It's one thing to tweak the output to make it parseable for your one-off script on your specific machine. Not so easy to make it reusable across the entire *nix world.


With regards to parted, if you only want to query for information, there is "partx" whose output was purposefully designed to be parsed. I have good experiences with it.


That doesn't solve the problem that bash is completely useless for manipulating JSON.

It certainly would make writing Python scripts that need to interact with other programs easier. But Python doesn't desperately NEED to interact with so many other programs for such simple tasks like enumerating files or making http requests or parsing json, the way bash does.


Bash is useless at JSON now. There's nothing stopping Bash from introducing native JSON parsing.


Then you have to install the new version of bash on every system you depend on json parsing, negating the argument that bash is installed everywhere.

If bash was ever actually going to get json parsing in reality, it should have done that two decades ago like all the other scripting languages, since JSON is 23 years old. So don't hold your breath.



The bash code which creates the c file which gets the list of null terminated files in a directory and compiles it, and runs it, is easier to write and understand. Bash is a lousy language to do anything in, python is almost always available, and if not, then CC is.


Recent discussion about the original "don't parse" page being referenced:

https://news.ycombinator.com/item?id=40692698 (10 days ago, 83 comments)


Files and directories, once a reference to them is obtained, should not be identified by their path. This causes all kinds of problems, like the reference breaking when the user moves or renames things, and issues like the ones described in the article, where some "edge case" (and I'm using that term very loosely, because it includes common situations like a space in a file name) causes problems down the line.

You might say that people don't move or rename things while files are open, but they absolutely do, and it absolutely breaks things. Even something as simple as starting to copy a directory in Explorer to a different drive, and then moving it while the copy is ongoing, doesn't work. That's pathetic! There is no technical reason this should not be possible.

And who can forget the case where an Apple installer deleted people's hard disk contents when they had two drives, one with a space character, and another one whose name was the string before the first drive's space character?

Files and directories need to have a unique ID, and references to files need to be that ID, not their path, in almost all cases. MFS got that right in 1984, it's insane that we have failed to properly replicate this simple concept ever since, and actually gone backwards in systems like Mac OS X, which used to work correctly, and now no longer consistently do.


IDs don't really solve many problems. The issues with scripts removing all your files were either caused by the absurd bash spaces and quotes rules, or by bash silently ignoring nonexistent variables. Those scripts would still need paths, since the ID of ~/.steam will be different for everyone. Scripts that need to work on more than one system, and human-authored config files, would still have paths. There are cases where you want to depend on the path, not the identity of the folder, and potentially swap the folder with something else without editing configuration.

Explorer needs to support local drives, with a lot of filesystems, including possibly third-party ones, but also network drives, FTP, WebDAV, and a bunch of other niche things. Not all of them have IDs and might not be possible to be extended. The cost is massive, solving it everywhere is impossible, and the benefit seems negligible to me (even though I fairly recently managed to eject a disk image (vhdx) in the middle of copying files onto it…)


Earlier versions of Mac OS had APIs to retrieve the IDs of directories and files relevant for things like installing applications (such as the the System directory). It effectively never used paths to identify any files; if users opened a file, they'd use the system file picker, which would provide the application a file ID, not a path.

Similarly, things like config files would be identified by their name, not their path, because the directory containing configs was a directory the system knew about. As a result, no application needed to know the path to its own config files.

This meant there was no action that the system prevented you from doing to an open file, other than actually deleting that file. There was also no way for an installer to accidentally break your system because its code didn't take your drive, file, or directory names into account.

And, of course, there are file systems that don't use paths at all, like HashFS, a bunch of modern document management systems, or the Newton's Soup.

I get your point about interoperability with existing file systems, but I think it's perfectly acceptable to offer better solutions where possible, and fall back to paths for situations where that is not possible.


This is a problem I faced recently on Linux. You can use ip addr to see the list of your IPv6 addresses and their types (temporary or not, etc). But doing it programmatically from a non-C codebase is way more involved.


Most of the time I avoid parsing ls, but I haven't found a reliable way to do this one:

  latest="$(ls -1 $pattern | sort --reverse --version-sort | head -1)"
Anyone got a better solution?


This ones a hard one. Since "--version-sort" isn't standard anyways, lets assume we can use flags which are common to BSD and GNU. Furthermore, lets assume bash or zsh so we can use "read -d ''".

In that case, how about:

  IFS='' read -d '' latest < <(find $pattern -prune -print0 | sort -z --reverse --version-sort)


This should work with any arbitrary filename:

    latest=$(printf '%s\0' <glob> | sort -zrV | head -zn1)
or with long args:

    latest=$(printf '%s\0' <glob> | sort --zero-terminated --reverse --version-sort | head --zero-terminated --lines 1


What unix is this on? Neither the mac nor gnu manpages have a -z or --zero-terminated option for head.



Yay! Glad to see zero termination flags in more places.

EDIT: The linux manpages I read were from die.net, which it looks like were from 2010, guess I'll have to avoid them in the future. I checked FreeBSD, OpenBSD, and Mac man page to make sure, and unfortunately none of them support the -z flag yet.


I just solve this by not having files like that on my computer. No spaces. No null chars.


Microsoft famously named "Documents and Settings" folder that way to defeat developers like you :).

Anyway check it, you might find you have files with spaces after all. For me it's:

  * /boot/System Volume Information
  * /proc/irq/126/PCIe PME
  * /sys/bus/platform/drivers/int3403 thermal
  * /etc/NetworkManager/system-connections/Hotel Xxx.nmconnection
  * /home/xxx/.cache/chromium/Default/Code Cache
  * and many others.


This is great. I `del mydocu~1` to kingdom come! Thank you. I meant in my home dir. I would never dare to presume anything about the rest. But it looks like Google demands that I be good at my job too.


i searched through the page and have not found `find ... -printf "%M %n %u %g %s ...\0"` mentioned. this way you get ls(1)-like output, yet machine-parseable.


Now of course having scripts and pre-commit hooks enforcing simple rules so that files must only use a subset of Unicode are a thing and do help.

Do you really think that, say, all music streaming services are storing their songs with names allowing Unicode HANGUL fillers and control characters allowing to modify the direction of characters?

Or... Maybe just maybe that Unicode characters belong to metadata and that a strict rule of "only visible ASCII chars are allowed and nothing else or you're fired" does make sense.

I'm not saying you always have control on every single filename you'll ever encounter. But when you've got power over that and can enforce saner rules, sometimes it's a good idea to use it.

You'll thank me later.


[flagged]


I do not think parsing json requires a full blown javascript engine.


Neither does a graphical front end but that’s never stopped them.


Just out of curiosity, what would be your proposal to use instead?


>Some front end clown is about to suggest all tools should output json by default aren’t they

This unironically sounds good (and, in case this matters, I'm not a front end "clown", but a reverse engineer who mostly uses C and Python). Unified formatted output from command line tools is a thing that is severely missing from unix ecosystem.


Json is maybe a bit heavy, but using a machine readable format such as tsv or csv (including configuring your terminal emulator to properly display it) would be a big step up from the status quo.


What do you suggest? Have 100 different ways to parse output? Think about the resulting code bloat.

And no, you don't need V8 to parse JSON.


Do you really think outputting a stream of JSON, as opposed to plain text, would add any measurable overhead to all command line tools?

Honestly, I'd love this. Output one JSON object per file, bash already has hash tables and lists, so it has all the types we need for JSON already.


Ever heard of jq?


sadly JSON does not handle non-utf8 strings


Is that a problem for this application, though? Don’t most people encode their file names in utf8, or is that an ASCII-centric falsehood?


if ls returned json then it would have to decide what to do with non-utf8 filenames (or even users and groups, I do not know what the rules are there); it could reture either "filename.txt" or {"encoding":"base64", "data":"<base64 blob>"} to obviate the problem but it is not a very elegant solution


An extension of this could be to also input everything in json: {"command":"ls", parameters: ["-l"]}, etc.


Didn't Microsoft try to define something like that with Powershell, with parameters being objects (though not JSON)?


But hey, at least it's not YAML!


The funny thing is that so so many bits of info come in very much like but not quite yaml.

e.g. /proc/cpuinfo


Of course, with some k8s yaml we can run all our cli tools in separate containers each with their own userland.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: