I think that when someone uses ls instead of a glob it means they most probably don't understand shell. I don't see any advantage of parsing ls output when glob is available. Shell is finicky enough to not invite more trouble. Same with word splitting, one of the reasons to use shell functions, because then you have "$@" which makes sense and any other way to do it is something I can't comprehend.
Maybe I also don't understand shell, but as it was said before: when in doubt switch to a better defined language. Thank heavens for awk.
> I think that when someone uses ls instead of a glob it means they most probably don't understand shell.
In 25 years of using Bash, I've picked up the knowledge that I shouldn't parse the output of ls. I suppose that it has something to do with spaces, newlines, and non-printing characters in file names. I really don't know.
But I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand. And ls is the only tool that I use to list files by hand - so I find it natural that people would (naively) pick ls as the tool to do that in scripts.
What I don't understand is why in the two most popular Unix flavors we have not got something like a json list output or something else that is parsable.
Is it really that difficult to add --json as a flag?
i think it boils down to then is dependencies needed to parse the json, coupled with the fact that glob syntax already covers iterating over the files regardless of characters used in the filename.
there are other tools than `ls` with their soul purpose to list files; some have "improved" features than ls, ect
similarlly from above (and well said btw, even what is not quoted):
> I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand
a lot of us do similar and we know/expect the ins and outs; and all it takes to break our scripts is some edge case we never thought of. we are fortunate that our keyboard layouts are basically ascii; other languages are less fortunate. now introduce open source community driven software where an ls escaped bash code deletes somebodys home directory (as ls was parsed and an edge case of some users files cause some obscure "fun" times). an edge case is still painful..
and finally, sometimes its better elevating said bash script to python (or awk), ect. just depends on the situation and level of complexity of logic
Globs don't help much when I want file attributes and sizes. Yeah I can pipe to something that can do it, but it would be nice to just get filenames, attrs, size, dates in a json array as output.
Look ls is one of the most basic and natural Unix commands. Make it modern and useful.
Bash gibberish is fun for gatekeeping scripting neckbeards, but it's not what a proper OS should have.
the thing with the coreutils is they provide basic core functionality; you dont need bash on your system - `ls` is not bash (and then you still end up with busybox where json still would not be part of ls). add more utilities to your system to do more complex logic; ive used similar apps to this in the past: https://github.com/kellyjonbrazil/jc
there's also using zero terminiated lines in ls with `--zero`; then piping that to a number of apps which also support similar (read,xargs,ect)
might also checkout powershell on linux which may suite your needs where instead of string manipulation, everything is a class object
Judging from the context, the user interface was fine in the days of limited resources (a 16 kiloword PDP-11 was cited) but then modern computers have the resources for better user interfaces.
They clearly didn't realize that even more modern Unix kernels would require hundreds of megabytes just to boot.
Of course, but the same (with a bit lower number of years) can be said about Windows, or HTTP, or the web with its HTML+JS+CSS unholy trinity, or email, or anything old and important really. It's scary how much of our modern infrastructure hinges on hacks made tens of years ago.
People new to the internet think alike. Still, not a day passes and we are once again reminded how fragile yet amazing this all information theory stuff is.
because I never got to the point where I could remember the strange punctuation that the shell requires for loops without looking up the info pages for bash whereas I've thoroughly internalized awk syntax.
Word is you should never write something like that because you'll never get the escaping right and somebody could craft inputs that would cause arbitrary code execution. I mean, they try to scare you into using xargs, but I find xargs so foreign I have to read the whole man page every time I want to do something with it.
I switched the command to a graphics magick based resize since that's the tool these days, default quality is 75% (for JPEG), but is included as a commonly desired customization. ,, is from a different comment in this thread; it seems better self-documenting than the single , I'd traditionally use.
Does that not fail when you hit the maximum command line length? Doesn't the entirety of the directory get splatted? Isn't this the whole reason xargs exists?
No, it does not fail. Maximum command line length exists in the operating system, not the shell; you can't launch a program with too many argc and you can't launch a program with an argv that's a string that's too long.
But when you execute a for loop in bash/sh, the 'for' command is not a program that is launched; it's a keyword that's interpreted, and the glob is also interpreted.
Thus, no, that does not fail when you hit the maximum command line length (which is 4096 on most _nix). It'll fail at other limits, but those limits exist in bash and are much larger. If you want to move to a stream-processing approach to avoid any limits, then that is possible, while probably also being a sign you should not use the shell.
The for loop only runs resize once per file. So no, the entire directory does not get splatted. It is unlikely you'd hit maximum command length.
At least on mac, the max command length is 1048576 bytes, while the maximum path length in the home directory is 1024 bytes. There might be some unix variant where the max path length is close enough to the max command length to cause an overflow, but I doubt that is the case for common ones.
xargs exists in an attempt to be able to parse command output. You could for instance have awk output xargs formatted file names to build up a single command invocation from arbitrary records read by awk. Note that xargs still has to obey the command line length limit though, because the command line needs to get passed to the program. Thus, in a situation where this for loop overflows the command line, it would cause xargs to also fail. Thus I would always use globbing if I have the choice.
EDIT: If you mean that the directory is splatted in the for loop, then in a theoretical sense it is. However, since "for" is a shell builtin, it does not have to care about command line length limits to my knowledge.
This shouldn't overrun the command line length for resize, since resize only gets fed one filename at a time. I do think that the for loop would need to hold all the filenames in a naive shell implementation. (I would assume most shells are naive in this respect) The for loop's length limit is probably the amount of ram available though. I find it improbable that one could overflow ram with purely pathnames on a PC, since a million files times 100 chars per file is still less than a gig of ram. If that was an issue though, one would indeed have to use "find" with "-exec" instead to make sure that one was never holding all file names in memory at the same time.
I just use find. it's a little longer but gives me the full paths and is more consistent. also works well if you need to recurse. something like `find . -type f | while read -r filepath; do whatever "${filepath}"; done`
I love this example, because it highlights how absolutely cursed shell is if you ever want to do anything correctly or robustly.
In your example, newlines and spaces in your filenames will ruin things. Better is
find … -print0 | while read -r -d $'\0'; do …; done
This works in most cases, but it can still run into problems. Let's say you want to modify a variable inside the loop (this is a toy example, please don't nit that there are easier ways of doing this specific task).
declare -a list=()
find … -print0 | while read -r -d $'\0' filename; do
list+=("${filename}")
done
The variable `list` isn't updated at the end of the loop, because the loop is done in a subshell and the subshell doesn't propagate its environment changes back into the outer shell. So we have to avoid the subshell by reading in from process substitution instead.
declare -a list=()
while read -r -d $'\0' filename; do
list+=("${filename}")
done < <(find … -print0)
Even this isn't perfect. If the command inside the process substitution exits with an error, that error will be swallowed and your script won't exit even with `set -o errexit` or `shopt -s inherit_errexit` (both of which you should always use). The script will continue on as if the command inside the subshell suceeded, just with no output. What you have to do is read it into a variable first, and then use that variable as standard input.
files="$(find … -print0)"
declare -a list=()
while read -r -d $'\0' filename; do
list+=("${filename}")
done <<< "${files}"
I think there's an alternative to this that lets you keep the original pipe version when `shopt -s lastpipe` is set, but I couldn't get it to work with a little experimentation.
Also be aware that in all of these, standard input inside the loop is redirected. So if you want to prompt a user for input, you need to explicitly read from `/dev/tty`.
My point with all this isn't that you should use the above example every single time, but that all of the (mis)features of shell compose extremely badly. Even piping to a loop causes weird changes in the environment that you now have to work around with other approaches. I wouldn't be surprised if there's something still terribly broken about that last example.
You have really proven your point even more than you meant to. Unfortunately none of these examples are robust.
The "-r" flag allows backslash escaping record terminators. The "find" command doesn't do such escaping itself, so that flag will cause files with backslashes at the end to concatenate themselves with the next file.
Furthermore, if IFS='' is not placed before each instance of read, or set somewhere earlier in the program, than each run of white-space in a filename will be converted into a single space.
EDIT: I proved your point even more. The "-r" flag does the opposite of what I thought it did, and disables record continuation. So the correct way to use read would be with IFS='' and the -r flag.
Both `find -exec` and xargs expect an executable command whereas `while read; ...; done` executes inline shell code.
Of course you can pass `sh -c '...'` (or Bash or $SHELL) to `find -exec` or xargs but then you easily get into quoting hell for anything non-trivial, especially if you need to share state from the parent process to the (grand) child process.
You can actually get `find -exec` and xargs to execute a function defined in the parent shell script (the one that's running the `find -exec` or xargs child process) using `export -f` but to me this feels like a somewhat obscure use case versus just using an inline while loop.
I will sometimes use the "| while read" syntax with find. One reason for doing so is that the "-exec" option to find uses {} to represent the found path, and it can only be used ONCE. Sometimes I need to use the found path more than once in what I'm executing, and capturing it via a read into a reusable variable is the easiest option for that.
I'd say I use "-exec" and "| while read" about equally, actually. And I admittedly almost NEVER use xargs.
When you don't want to waste your time and sanity and happiness being in doubt and then throwing away all you've done and switching to a new language in mid stream, just don't even start using a terribly crippled shell scripting language in the first place, also and especially including awk.
The tired old "stick to bash because it's already installed everywhere" argument is just as weak and misleading and pernicious as the "stick to Internet Explorer because it's already installed everywhere" argument.
It's not like it isn't trivial to install Python on any system you'll encounter, unless you're programming an Analytical Engine or Jacquard Loom with punched cards.
In most places where I run shell scripts, there is no Python. There could be if I really wanted it but it's generally unnecessary waste.
On top of it, shell is better than Python for many things, not to mention faster.
It's also, as you mentioned, ubiquitous.
In the end, choose the tool that makes more sense. For me, a lot of the time, that's a shell script. Other times it may be Python, or Go, or Ruby, or any of the other tools in the box.
A waste of what, disk space? I'd much rather waste a few megabytes of disk space than hours or days of my time, which is much more precious. And what are you doing on those servers, anyway? Installing huge amounts of software, I bet. So install a little more!
For decades, on most Windows computers I run web browsers, there's always Internet Explorer. So do you still always use IE because installing Chrome is "wasteful"? It's a hell of a lot bigger and more wasteful than Python. As I already said, that is a weak and misleading and pernicious argument.
So what exactly is bash better than Python at, besides just starting up, which only matters if you write millions of little bash and awk and sed and find and tr and jq and curl scripts that all call each other, because none of them are powerful or integrated enough to solve the problem on their own.
Bash forces you to represent everything as strings, parsing and serializing and re-parsing them again and again. Even something as simple as manipulating json requires forking off a ridiculous number of processes, and parsing and serializing the JSON again and again, instead of simply keeping and manipulating it as efficient native data structures.
It makes absolutely no sense to choose a tool that you know is going to hit the wall soon, so you have to throw out everything you've done and rewrite it in another language. And you don't seem to realize that when you're duct-taping together all these other half-assed languages with their quirky non-standard incompatible byzantine flourishes of command line parameters and weak antique domain specific languages, like find, awk, sed, jq, curl, etc, you're ping-ponging between many different inadequate half-assed languages, and paying the price for starting up and shutting down each of their interpreters many times over, and serializing and deserializing and escaping and unescaping their command line parameters, stdin, and stdout, which totally blows away bash's quick start-up advantage.
You're arguing for learning and cobbling together a dozen or so different half-assed languages and flimsy tools, none of which you can also use to do general purpose programming, user interfaces, machine learning, web servers and clients, etc.
Why learn the quirks and limitations of all those shitty complex tools, and pay the cognitive price and resource overhead of stringing them all together, when you can simply learn one tool that can do all of that much more efficiently in one process, without any quirks and limitations and duct tape, and is much easier to debug and maintain?
> For decades, on most Windows computers I run web browsers, there's always Internet Explorer. As I already said, that is a weak and misleading and pernicious argument.
On its own, I agree. But you glossed over everything else I said, so I'm not going to entertain your weak argument.
You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.
Like I said, for most of my use cases where I use shell scripting, it's the best tool for the job. If you don't believe me, or think you know better about my circumstances than I do, all the power to you.
> You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.
I have worked on projects that are extremely sensitive to extra dependencies and projects that aren't.
Sometimes I am in an underground bunker and each dependency goes through an 18 month Department of Defense vetting process, and "Just install python" is equivalent to "just don't do the project". Other times I have worked on projects where tech debt was an afterthought because we didn't know if the code would still be around in a week and re-writing was a real option, so bringing in a dependency for a single command was worthwhile if we could solve the problem now.
There is appetite for risk, desire for control, need for flexibility, and many other factors just as you stated that DonHopkins is ignoring or unaware of.
I'm an avid jq user. There are certainly situations where it's better to use python because it's just more sane and easier to read/write, but jq does a few things extremely well, namely, compressing json, and converting json files consisting of big-ass arrays into line delimited json files.
One advantage: `ls -i` gives you the file's inode in a POSIX portable way. If you glob and then look it up individually for each file, you'll need to be aware of which tool (and whether it's GNU or BSD in origin) you use on which platform.
In general yes globbing is better for iterating through files. But parsing `ls` doesn't necessarily mean the author doesn't know shell. It might mean they know it well enough to use the tools that are made available to them.
Usually the pattern is "for f in [glob]", which doesn't have that issue. Running "ls" on a directory is little more than "for f in *; echo $f" so there's little advantage to using "ls".
Also: "find -exec {} \+" will take ARG_MAX into account, and may be much faster depending on what you're doing.
It's not the case very often these days but it used to be quite simple to blow up your script globbing in a directory with a lot of files and you can still hit the limit if you pass a glob to some command because it can blow up trying to execve() Here's more details of the issue and some workarounds https://unix.stackexchange.com/questions/120642/what-defines...
It's anchored on the left. ${name#subdir/} will turn 'subdir/abc' into 'abc', but will not touch foo/subdir/bar. I don't think bash even has syntax to replace in the middle of an expansion, I always pull out sed for that.
Thanks for clarifying, I learned something new today!
Edit: It turns out that Bash does substitutions in the middle of strings using the ${string/substring/replacement} and ${string//substring/replacement} syntax, for more details see https://tldp.org/LDP/abs/html/string-manipulation.html
I'd really like if the "find" command supported this much easier, so if I write
find some/dir/here -name '*.gz'
then I could get the filenames without the "some/dir/here" prefix.
It would also be nice if "find" (and "stat") could output the full info for a file in JSON format so I could use "jq" to filter and extract the needed info safely instead of having to split whitespace seperated columns.
I finally started really using my shell after switching to it. I casually write multiple scripts and small functions per day to automate my stuff. I'm writing scripts I'd otherwise write in python in nu. All because the data needs no parsing. I'm not even annotating my data with types even though Nushell supports it because it turns out structured data with inferred types is more than you need day-to-day. I'm not even talking about all the other nice features other shells simply don't have. See this custom command definiton:
# A greeting command that can greet the caller
def greet [
name: string # The name of the person to greet
--age (-a): int # The age of the person
] {
[$name $age]
}
Here's the auto-generated output when you run `help greet`:
A greeting command that can greet the caller
Usage:
> greet <name> {flags}
Parameters:
<name> The name of the person to greet
Flags:
-h, --help: Display this help message
-a, --age <integer>: The age of the person
It's one of the software that only empowers you, immediately, without a single downside. Except the time spent learning it, but that was about a week for me. Bash or fish is there if I ever need it to paste some shell commands.
Parsing, or the lack thereof, is not the point. The point is that standard shells already provide all the tools you need for dealing with lists of files. Want to do something for every file? Write this:
shopt -s nullglob
for f in *; do
…
done
But never this:
for f in $(ls); do
…
done
They look similar, but the latter runs ls to turn the list of files into a string, then has the shell parse the string back into a list. Even if the parsing was done correctly (and it isn’t), this is still extra work. Looping over the glob avoids the extra work.
I didn’t say that nushell is bad, I said that it’s not relevant to the discussion. nushell provides typed data in pipelines, which is cool. But standard shells already have typed data for this particular use case, thus parsing untyped data is unnecessary. Of course it would be nice if that typed data could be used in a pipeline, but everything had to start somewhere.
Posts like these are like the main character threads on twitter where someone says, "men don't do x" or "women aren't like y." It just feels like people outside of you who have no understanding of your context seem intent on making up rules for how you should code things.
Perhaps it would help to translate this into something more like, "what pitfalls do you run into if you parse `ls`" but it's hard to get past the initial language.
When we say "don't do X" we mean "the obvious way is wrong". If you have enough knowledge to ignore the advice, you likely are already aware of the problems with the obvious solution.
I'm pretty sure you can come up with scenarios where parsing the output of "ls" is indeed the simplest solution, but that kind of article is supposed to discourage people who don't know better from going "oh, I know, I'll just parse the output of ls". As a general advice, people should indeed be pointed towards "man find" or "man opendir 3".
I think there's a middle point where you want to do something that's complex enough that a glob won't cut it but simple enough that switching languages is not worth it.
I think the example of "exclude these two types of files" is a good case. I often have to write stuff like `ls P* | grep -Ev "wav|draft"` which doesn't solve a problem I don't have (such as filenames with newlines in them) but does solve the one I do (keeping a subset of files that would be tricky to glob properly).
In my experience 95% of those scripts are going to be discarded in a week, and bringing Python into it means I need to deal with `os.path` and `subprocess.run`. My rule of thumb: if it's not going to be version controlled then Bash is fine.
You might enjoy a variety of `find` based commands, e.g.
`find -maxdepth 1 -iregex ".*\.(wav|draft)" | xargs echo "found file:"`
This uses regex to match files ending in .wav or .draft (which is what I interpreted you to want). Xargs then processes the file. You could use flags to have xargs pass the file names in a specific place in the command, which can even be a one liner shell call or some script.
So the "find <regex> - xarg <command>" pattern is almost fully generally applicable to any problem where you want to execute a oneliner on a number of files with regular names. (I think gnu find has no extended regex, which is just as well- thats not a "regular expression" at that point)
Definitely do it this way if you want to stick to the pre-filtered version (I recommend the cousin comment, filter inside the loop). GP's version is buggy in the same way as the post misunderstands, particularly with files that somehow got newlines in the filename (xargs is newline-delimited by default).
If for some reason you do need the "find | xargs" combo (maybe for concurrency), you can get it to work with "find -print0" and "xargs -0". Nulls can't be in filenames so a null-delimited list should work.
The latest standard I know of is SuS 2018, which I have the docs for, and does not include either switch. I searched around a bit and it doesn't seem like there is a new one. Are you referring to some draft? I sure wish this was true.
That being said, I would interpret "-exec printf '%s\0' {} +" as being a posix compliant way for find to output null delimited files. I say this since the docs for the octal escape for printf allows zero digits. However, most posix tools operate on "text" input "files", which are defined as not having null characters. Thus I don't think outputting nulls could be easily used in a posix complaint way. In practice, I would expect many posix implementations to also not handle nulls well because C uses null to mean end of string, so lots of C library calls for dealing with strings will not correctly deal with null characters.
>GP's version is buggy in the same way as the post misunderstands, particularly with files that somehow got newlines in the filename
I understand this caveat, but I never had a file with newline that I cared about. Everyone keeps repeating this gotcha but I literally don't care. When I do "ls | grep [.]png\$ | xargs -i,, rm ,," (yes, stupid example) there is 0% chance that a png file with a newline in the name found itself in my Downloads folder. Or my project's source code. Or my photo library. It just won't happen, and the bash oneliner only needs to run once. In my 20 years of using xargs I didn't have to use -0 even once.
See the other response I got, I misremembered (and waaay too late to edit) - it's whitespace, not newlines. I'm sure you've had files with spaces in the name.
>[..] arguments in the standard input are separated by unquoted <blank> characters [..]
As for -i, it is documented to be the same as -I, which, among other things, makes it so that "unquoted blanks do not terminate input items; instead the separator is the newline character."
It's not necessary to bring Python into it, Bash can handle filenames with weird characters properly if you know how to use it.
E.g. instead of `ls | grep -Ev 'wav|draft'`, you'd have to do something like
for filename in *; do
if grep -E 'wav|draft' >/dev/null <<< "$filename"
then : # ...
fi
done
Of course, it's more convoluted, but when you're writing scripts that might be used for a long time and by many people, it helps to know that it is possible to write robust things. Tools like shellcheck certainly help.
At that point I think you need to ask yourself why you're using Bash to begin with. If it's just meant to be a quick script that's run occasionally then this is good but probably overkill. If it's going into prod to be run regularly as part of business critical, then it should be in a language that has a less convoluted way to _ls a directory_. There's an inflection point somewhere in there, where it is depends on you.
Edit: I missed the herestring in the original code, so the above is wrong as mentioned in the comments; if your find has regex, you can use it to save one grep:
Otherwise you can call sh to printf the filename into a grep.
However, the point of my post is that find can perform seek, filter and execute, and should be used for all three unless it is really impossible (which is unlikely).
Before you write anything, you need to think about the cost of it breaking and the chance of it breaking, and Bash scripts in VC tend to maximize both. I like that heuristic a lot.
The title omits the final '?' which is important, because the rant and its replies didn't settle the matter.
Shellcheck's page on parsing ls links to the article the author is nitpicking on, but it also links to the answer to "what to do instead": use find(1), unless you really can't. https://mywiki.wooledge.org/BashFAQ/020
I guess this is for shell scripts that need to work with "unsafe" filenames?
I've been using Linux since 1999 and i never came across a filename with newlines. On the other hand, pretty much all "ls parsing" i've done was on the command-line to pipe it to other stuff in files i was 100.1% sure would be fine.
When teaching beginners shell, it's natural to teach `ls` for listing directory contents. It's also natural to extend from `ls` to `ls | ...` for processing lists of files
The important point to get across is that pipes let us build bigger commands from the commands we already know. If needed, you can back up later to teach patterns like `find [...] -exec`, `find [...] -print0 | xargs -0 [...]`, `find [...] | while read -r file; do [...] done` and so on.
There are all kinds of prerequisites to creating files with unusual names. Those barriers tend to mean beginners won't run into file name processing edge cases for a while. The exception will be files they download from the Internet. But the complexity there will usually be quote and non-ASCII Unicode characters, not newlines or other control codes.
In teaching, the one filename complexity I would try to get ahead of, preventively, is spaces. There was a time, way back when, when newbies seemed to expect to stick with short, simple filenames. These days they the people I've helped tended to be used to using spaces in file names in Finder and Explorer for office or school work.
Not piping strings avoids this issue completely. Marcel’s ls produces a stream of File objects, which can be processed without worrying about whitespace, EOL, etc.
In general, this approach avoids parsing the output of any command. You always get a stream of Python values.
As long as Python does the right thing with globs, there is really no room for marcel to get it wrong. Not sure what additional validation you are thinking of.
The File object encapsulates a path. If you use it, e.g. to read contents, and the file doesn't exist, then it will fail with an appropriate error message.
Files come and go. References to them go stale. Every user and tool deals with this. This isn't a "validation" issue.
Using magic, I've renamed any files you have to remove control characters in the name and made it impossible to make any new ones. (You can thank me later.)
>But for whatever reason, when it is suggested, you get many people chiming in that "filenames should be dumb bytes, anything allowed except / !"
I guess the issue is not what filenames should be, but what filenames are. In general, when interacting with files, you have to expect everything but `/` and nullbyte. Even if you forbid it on your machine, someone may mount a NFS drive and open you to the world of weird filenames. And you never know who uses your code.
And the unicode itself is weird anyway - for example you may have normalised and denormalised names which may be the same or different string depending on how you look at them[1]. And I hope you are not planning to restrict filenames to some anglocentric [a-z0-9_- ]*, because the world is much larger and you can't pretend unicode doesn't exist.
I've put a file in my home directory named the entirety of your comment, newline included. Unfortunately I had to trim "except / !" to bring it to 255 characters.
Now at least when some tool or pipeline blows up horribly, it'll be hilarious.
I'm probably not the best person to ask, since the last time I touched Powershell, it was Windows only, but I'd say nushell is likely a lot more platform-agnostic, has sane syntax and follows a functional paradigm. Plugins are written in Rust. It's probably not worth it if all you do is Windows sysadmin work, as you'd have to serialize and deserialize data when interacting with Powershell from nu.
Last I looked, powershell's startup time on linux was disappointing. Understandable to an extent given it was bootstrapping a bunch of dotNET stuff that would already be there on windows.
But slow enough that I couldn't use or recommend it to my team.
Alternative shells or higher languages don't solve _all_ the issues.
I won't install a new shell to generate a file list on my CI server. I won't install a new shell on remote machines. Ever.
These structured shells also require commands to be aware of them, either via some plugin that structures their raw I/O output or some convention. They solve _some_ command output structuring but not _all_ the general problem.
So, the answer is good. It promotes the idea that one should be careful when machine parsing output meant for humans.
> I won't install a new shell to generate a file list on my CI server. I won't install a new shell on remote machines. Ever.
Uh... that's on you? Why do you intentionally hinder yourself?
> These structured shells also require commands to be aware of them, either via some plugin that structures their raw I/O output or some convention. They solve _some_ command output structuring but not _all_ the general problem.
Okay. It doesn't solve literally every single problem, that is true. It's still miles ahead. And when interfacing with non-pwsh commands, you just fall back to text parsing/output.
> Uh... that's on you? Why do you intentionally hinder yourself?
Hinder myself? An ephemeral cloud machine would not keep my custom shell anyway. By having to install it _every single time I connect_ I just loose precious time.
I want to be familiar with tools that are _already_ installed everywhere.
The shell is supposed to be a bottom feeder, lowest common denominator, barely usable tool. That way, it can build soon and get stable real fast. That (unintentional) strategy placed it as a core infrastructural piece... everywhere.
Of course, there's scripting and using it on the terminal. But we're talking about scripting, right? Parsing ls and stuff. I want the fast, lean, simple `dash` to parse my fast, lean simple scripts. pwsh is fine for the terminal leather seats.
Isn't it ironic that Powershell from Microsoft is so much vastly superior than bash, not because it's great or even better than Python, but because bash is such a terribly low bar to beat, that it totally undermines the "Unix Philosophy"?
Who would have thought that little old Microsoft, purveyors of MSDOS CMD.EXE, would have leapfrogged Unix and come out with something so important and fundamental as a shell that was superior to all of Unix's "standard" sh/csh/bash/whatever shells in so many ways, all of which historically used to be and ridiculously still are touted by Unix Supremacists as one of its greatest strengths?
You see, Microsoft is willing to look at the flaws in their own software, and the virtues of their competitors' software, then admit that they made mistakes, and their competitors did something right, and finally fix their own shit, unlike so many fanatical monolinguistic Unix evangelists.
They did the exact same thing to Java and JavaScript, leaving Visual Basic and CMD.EXE behind in the dustbin of history -- just like Unix should leave bash behind -- resulting in great cross platform languages like C# and TypeScript.
Edit: that reinforces my point that taking so long to get there is a hell of a lot better than taking MUCH LONGER to NOT get there.
Maybe bash's legacy inertia is a problem, not a virtue. Is certainly isn't getting a JSON parser in the foreseeable future. The ironic point is that even Microsoft's power shell has much less legacy inertia, and therefore is so much better, in such a shorter amount of time.
> Isn't it ironic that Powershell from Microsoft is so much vastly superior than bash
I agree that powershell is now better than bash. But it took SO LONG to get there. Moreover, bash has had a 12 year head-start (ok, 30 if you count earlier unix shells). Bash has legacy inertia. Even though you can now supposedly run powershell in linux, I don't know anyone who does. Does anybody?
That said, I think powershell is great for utility-knife uses on windows machines.
> Even though you can now supposedly run powershell in linux, I don't know anyone who does. Does anybody?
I do. I replaced all of the automation scripts on my rpi with pwsh scripts, and I'm not regretting it. Not having to deal with decades of cruft in argument parsing and string handling, learning little DSLs for every command, etc. is so worth it.
At this point, all PowerShell has accomplished is creating a separate ecosystem. The designers set out to make a "better" shell and yet refused to ever learn the things they were allegedly "improving".
But hey, that's a fixable problem, right? No, because PowerShell is so suffused with arrogance about its superiority that anything, no matter how simple it was to do in a UNIX shell, has to be cross-examined, re-imagined, and bent over the wheel of PowerShell's superiority, before ultimately getting ignored or rejected anyway.
PowerShell is a language unto itself. It is not a replacement for bash/zsh/etc because nobody who knows the latter well can easily migrate to the former, and that's by design.
I want there to be something better than the UNIX shells, at least when it comes to error handling and data parsing. PowerShell was supposed to be that tool, but it seems to have lost sight of that goal somewhere along the way.
If you're going to skip using the standard shell that is installed everywhere by default, then you should go ahead and use a full language with easily distributed binaries.
Many people turn to globbing to save them, which is usually better, but has some problems in case of no matches. But, for Bash, you can do this to fix it:
I don't know, this seems like a lot of words to avoid coming to the conclusion that there are many ways to skin a directory.
Most of the time it's fine to just suck in ls and split it on \n and iterate away, which I do a lot because it's just a nice and simple way forward when names are well-formed. Sometimes it's nicer to figure out a 'find at-place thing -exec do-the-stuff {} \;'. And sometimes one needs some other tool that scours the file system directly and doesn't choke on absolutely bizarre file names and gives a representation that doesn't explode in the subsequent context, whatever that may be, which is quite rare.
A more common issue than file names consisting of line breaks is unclean encodings, non-UTF-8 text that seeps in from lesser operating systems. Renaming makes the problem go away, so one should absolutely do that and then crude techniques are likely very viable again.
I wonder if anyone has implemented kernel module or smth to limit filenames to sane set. Just ensuring that they are valid utf8 and do not contain any non-printables would be huge improvement. Sure some niche applications might break so its not something that can be made default, but still I think it would help on systems I control.
These sorts of pedantic exchanges are so pointless to me. We are programmers. We can control what characters are used in filenames. Then you can use the simplest tool for the job and move on with your life to focus on the stuff that actually matters. Fix the root cause instead of creating workarounds for the symptom.
The same information is already available in a machine–readable format. Just call readdir. You don’t need to run ls, have ls call readdir and convert the output into JSON, and then finally parse the JSON back into a data structure. You can just call readdir!
You’re still doing unnecessary work. You’re turning a list of files into a string, then parsing the string back into words.
Your shell already provides a nice abstraction over calling readdir directly. A glob gives you a list, with no intermediate stage as a string that needs to be parsed. You can iterate directly over that list.
Every language provides either direct access to the C library, so that you can call readdir, or it provides some abstraction over it to make the process less annoying. In Common Lisp the function `directory` takes a pathname and returns a list of pathnames for the files in the named directory. In Rust there is the `std::fs::read_dir` that gives you an iterator that yields `io::Result<std::fs::DirEntry>`, allowing easy handling of io errors and also neatly avoiding an extra allocation. Raku has a function `dir` that returns a similar iterator, but with the added feature that it can match the names against a regex for you and only yield the matches. You can fill in more examples from your favorite languages if you want.
There is a glob() function you can use in POSIX C also to get an array of strings.
The getdents system call being used in the above program is the basis for implementing readdir.
It doesn't return a string, but rather a buffer of multiple directory entries.
The program isn't parsing a giant string; it is parsing out the directory entry structures, which are variable length and have a length field so the next one can be found.
The program writes each name including the null terminator, so that the output is suitable for utilities which understand that.
The problem is the phrase “suitable for shell pipelines”. If you are in a shell, you should not be doing anything like this. You should use a glob directly in the shell. You should not be calling an external program, having that program print out something, and then parsing it. Just use a glob right there in your shell script. If you do anything else, you are doing it wrong.
Right, globs are syntactic sugar on top of readdir. Definitely use them when you are in a shell. But in general the solution is to call readdir, or some language facility built directly on top of it. Calling ls and asking it for JSON is the stupid way to do things.
It doesn't require trying to organize a small revolution across dozens of GNU tools, many authors, and numerous distros...?
I'd love to see standard JSON output across these tools. I just don't see a realistic way to get that to happen in my lifetime.
Maybe a unified parsing layer is more realistic, like an open source command output to JSON framework that would automatically identify the command variant you're running based on its version and your shell settings, parse the output for you, and format it in a standard JSON schema? Even that would be a huge undertaking though.
There are a lot, LOT of command variants out there. It's one thing to tweak the output to make it parseable for your one-off script on your specific machine. Not so easy to make it reusable across the entire *nix world.
With regards to parted, if you only want to query for information, there is "partx" whose output was purposefully designed to be parsed. I have good experiences with it.
That doesn't solve the problem that bash is completely useless for manipulating JSON.
It certainly would make writing Python scripts that need to interact with other programs easier. But Python doesn't desperately NEED to interact with so many other programs for such simple tasks like enumerating files or making http requests or parsing json, the way bash does.
Then you have to install the new version of bash on every system you depend on json parsing, negating the argument that bash is installed everywhere.
If bash was ever actually going to get json parsing in reality, it should have done that two decades ago like all the other scripting languages, since JSON is 23 years old. So don't hold your breath.
The bash code which creates the c file which gets the list of null terminated files in a directory and compiles it, and runs it, is easier to write and understand. Bash is a lousy language to do anything in, python is almost always available, and if not, then CC is.
Files and directories, once a reference to them is obtained, should not be identified by their path. This causes all kinds of problems, like the reference breaking when the user moves or renames things, and issues like the ones described in the article, where some "edge case" (and I'm using that term very loosely, because it includes common situations like a space in a file name) causes problems down the line.
You might say that people don't move or rename things while files are open, but they absolutely do, and it absolutely breaks things. Even something as simple as starting to copy a directory in Explorer to a different drive, and then moving it while the copy is ongoing, doesn't work. That's pathetic! There is no technical reason this should not be possible.
And who can forget the case where an Apple installer deleted people's hard disk contents when they had two drives, one with a space character, and another one whose name was the string before the first drive's space character?
Files and directories need to have a unique ID, and references to files need to be that ID, not their path, in almost all cases. MFS got that right in 1984, it's insane that we have failed to properly replicate this simple concept ever since, and actually gone backwards in systems like Mac OS X, which used to work correctly, and now no longer consistently do.
IDs don't really solve many problems. The issues with scripts removing all your files were either caused by the absurd bash spaces and quotes rules, or by bash silently ignoring nonexistent variables. Those scripts would still need paths, since the ID of ~/.steam will be different for everyone. Scripts that need to work on more than one system, and human-authored config files, would still have paths. There are cases where you want to depend on the path, not the identity of the folder, and potentially swap the folder with something else without editing configuration.
Explorer needs to support local drives, with a lot of filesystems, including possibly third-party ones, but also network drives, FTP, WebDAV, and a bunch of other niche things. Not all of them have IDs and might not be possible to be extended. The cost is massive, solving it everywhere is impossible, and the benefit seems negligible to me (even though I fairly recently managed to eject a disk image (vhdx) in the middle of copying files onto it…)
Earlier versions of Mac OS had APIs to retrieve the IDs of directories and files relevant for things like installing applications (such as the the System directory). It effectively never used paths to identify any files; if users opened a file, they'd use the system file picker, which would provide the application a file ID, not a path.
Similarly, things like config files would be identified by their name, not their path, because the directory containing configs was a directory the system knew about. As a result, no application needed to know the path to its own config files.
This meant there was no action that the system prevented you from doing to an open file, other than actually deleting that file. There was also no way for an installer to accidentally break your system because its code didn't take your drive, file, or directory names into account.
And, of course, there are file systems that don't use paths at all, like HashFS, a bunch of modern document management systems, or the Newton's Soup.
I get your point about interoperability with existing file systems, but I think it's perfectly acceptable to offer better solutions where possible, and fall back to paths for situations where that is not possible.
This is a problem I faced recently on Linux. You can use ip addr to see the list of your IPv6 addresses and their types (temporary or not, etc). But doing it programmatically from a non-C codebase is way more involved.
This ones a hard one. Since "--version-sort" isn't standard anyways, lets assume we can use flags which are common to BSD and GNU. Furthermore, lets assume bash or zsh so we can use "read -d ''".
Yay! Glad to see zero termination flags in more places.
EDIT: The linux manpages I read were from die.net, which it looks like were from 2010, guess I'll have to avoid them in the future. I checked FreeBSD, OpenBSD, and Mac man page to make sure, and unfortunately none of them support the -z flag yet.
This is great. I `del mydocu~1` to kingdom come! Thank you. I meant in my home dir. I would never dare to presume anything about the rest. But it looks like Google demands that I be good at my job too.
i searched through the page and have not found `find ... -printf "%M %n %u %g %s ...\0"` mentioned. this way you get ls(1)-like output, yet machine-parseable.
Now of course having scripts and pre-commit hooks enforcing simple rules so that files must only use a subset of Unicode are a thing and do help.
Do you really think that, say, all music streaming services are storing their songs with names allowing Unicode HANGUL fillers and control characters allowing to modify the direction of characters?
Or... Maybe just maybe that Unicode characters belong to metadata and that a strict rule of "only visible ASCII chars are allowed and nothing else or you're fired" does make sense.
I'm not saying you always have control on every single filename you'll ever encounter. But when you've got power over that and can enforce saner rules, sometimes it's a good idea to use it.
>Some front end clown is about to suggest all tools should output json by default aren’t they
This unironically sounds good (and, in case this matters, I'm not a front end "clown", but a reverse engineer who mostly uses C and Python). Unified formatted output from command line tools is a thing that is severely missing from unix ecosystem.
Json is maybe a bit heavy, but using a machine readable format such as tsv or csv (including configuring your terminal emulator to properly display it) would be a big step up from the status quo.
if ls returned json then it would have to decide what to do with non-utf8 filenames (or even users and groups, I do not know what the rules are there); it could reture either "filename.txt" or {"encoding":"base64", "data":"<base64 blob>"} to obviate the problem but it is not a very elegant solution
Maybe I also don't understand shell, but as it was said before: when in doubt switch to a better defined language. Thank heavens for awk.