1. Shell-escaping all values used to construct commands
Avoid shell-escaping and the shell altogether! Use list form program execution wherever possible.
In Perl:
open(my $fh, "-|", "find", $dir, "-type", "f", "-print0") or die;
In Python:
f = os.popen2(["find", dir, "-type", "f", "-print0"])
If you're executing a pipeline, you do need to get the shell involved. But I'd probably write the author's examples like so (once again avoiding the need to shell escape arguments):
f = os.popen2(["sh", "-c", \
'find "$0" -type f -print0 | xargs -0 grep foo | wc -l', dir])
2. Prefixing each multi-command pipeline with “set -o pipefail;
Alas, pipefail is not in POSIX /bin/sh, which many of us prefer for shell scripting. It's in bash and ksh though.
3. Explicitly checking for failure after each shelled out command
Valid. You should always do this. In shell scripts, "set -e" is a good thing, although pipelines without "set -o pipefail" are still a problem.
The article does not address another significant point: filenames should also be classified as untrusted input. Using -- to separate flags from arguments can help:
/tmp$ mkdir x
/tmp$ cd x
/tmp/x$ echo hello > a
/tmp/x$ grep hello *
hello
/tmp/x$ touch -- -q
/tmp/x$ grep hello *
/tmp/x$ grep hello -- *
a:hello
/tmp/x$
It's rare to see protection against this problem either when shelling out or in native shell scripts.
* hsh https://github.com/jgoerzen/hsh/wiki makes pipelines in haskell, using operators so they really look like pipelines, but without involving the shell. Example: "ls -l" -|- "wc -l"
My mistake, not knowing python's variable interpoltion well, I thought that the $0 was expanded by it, not the shell. Which, if it were the case, would indeed be vulnerable.
Isn't it a bit harsh to downvote to -1 a comment which links to two completly on-topic libraries?
No. Shell scripts have the same problem where you're not separating options from filenames and you're often not preventing a single user-supplied argument from expanding into multiple final arguments.
The list-form shell commands referenced earlier are closer, but still don't separate filenames (input) from arguments unless you remember to add "--".
But they do prevent things like pipes, which is a big deal.
I remember ages ago getting in an argument with the kernel hackers on unix-wizards. My position was that filenames shouldn't be allowed to have meta-characters in them because it made shell programming difficult and dangerous. Their argument was that the spirit of Unix was that the kernel should be as simple as it possibly could be, and that you should "Live free or die". Therefore it was not the kernel's business to dictate what characters should be able to be in a filename.
They were wrong of course. The true spirit of Unix was to enable programmer productivity by being able to easily piece together new programs out of reusable components. E.g., via shell scripting, and the kernel hackers' notions about filenames and living free was directly at odds with this true spirit.
The OP claims that they've solved this dilemma in Julia. I await with bated breath.
Back in the day, there were only two shells: The Bourne Shell and the C Shell. And before that there was only whatever the Bourne Shell was based on.
In any case, the answer to this is that the kernel developers and the shell developers get together and decide on something, and then future shells have to live within those constraints. At the point that they should have done this, they should have made sure that this was flexible enough for future needs.
Personally, I would have been fine if filenames were restricted to more or less to the same degree that variables names are restricted in, let's say, C. Why can't I name my variables whatever I want to? I thought it was "Live free or die!"?!?!?!?!
YES! And while we are at it, let's get rid of SQL injections by forbidding spaces and quotes in [VAR]CHAR fields!
Limiting valid file names won't protect from injections, won't make such protection easier (asserting lack of special characters isn't significantly easier than escaping same characters), but will cripple system functionality. Personally_comma_I_would_hate_seeing_such_file_names.
This discussion has ABSOLUTELY NOTHING to do with SQL injection. That is a completely different issue, as you may indeed want and need to put all of War and Peace into a char field, but there is no valid use case for putting all of War and Peace into a filename.
Furthermore, it was quite traditional for Unix filenames to be written exactly has you have done so above. (Although dashes would have been more commonly used than underscores.) And, in fact, it was considered fairly anti-social to put spaces, carriage returns, etc., in your filenames, as Unix utilities could not cope with them.
The fact that for decades Unix utilities could only cope with filenames written as you have above, proves my point, not yours. Even now, when Unix utilities are finally getting to the point where they might be able to cope with any combination of characters in a filename (other than null or "/"), this has come at great extra complexity and requiring much extra care on the part of the programmer. Which again proves my point, not yours.
Now, of course, whenever anyone asserts that something requires too much care, someone else will quite wrongly assert that programmers should not be encouraged to be lazy. That real programmers should work for a living. This is, of course, absurd. The best programmers will always be lazy... in the right way. They will want to solve a problem that accomplishes the most, in the least amount of time, with the least amount of code. I.e., they will want to be productive. People who make quips about lazy programmers are doing a huge disservice to the world by promulgating a world with less productive programmers who consequently accomplish less. Which means that, for all you know, cancer won't be cured when otherwise it might have been.
As to your proposed solution of having the shell escape special characters, rather than having the kernel disallow them, that could probably work too, at the cost of a consistency. I.e., when you look at a filename in the shell, it's going to look different from how it will look in your GUI browser. E.g., the shell is going to display your filename in the manner that you have done, only it probably won't do as good a job as you would at encoding your intent into a limited character set.
And, if you are going to take this approach, it doesn't have to be by allowing special characters in filenames. It can be done either by encoding special characters in non-special characters, or by having a display name that is different from the filename.
Additionally, I find it amusing when people assert that having limitations in filenames is crippling system functionality, and yet they don't make the same assertion about identifier names in programming languages. Oh no! Having a limited set of characters in variable names has crippled my ability to program... NOT!
I am not in any way against file naming conventions, i am against enforcement of convention, specific for one use case (programming) in general-purpose system.
When shell script is written for some specific task, you can relay on convention and receive all productivity benefits even without kernel enforcement. If somebody creates files with carriage return in source tree because kernel doesn't stop him, problem is social, not technological.
> I find it amusing when people assert that having limitations in filenames is crippling system functionality, and yet they don't make the same assertion about identifier names in programming languages.
Programming language has narrower usage field than operating system. Naming variable "Мой любимый щеночек (01.02 12:34).jpg" (my favorite puppy) is absurd. Having file with such name is perfectly reasonable.
What is actually my primary point - file names in general are not program internals. They are part of user data and should be treated like that.
> As to your proposed solution of having the shell escape special characters, rather than having the kernel disallow them
Not really what i meant. I was saying, that program (be it shell script, or application calling 'system()'), that intends to work on arbitrary, user provided file names, won't benefit from kernel-enforced limitations. "process $FOO" won't become protected from misuse and exploits if special character will be forbidden, application will still have to check for "bar; rm -rf .", and checking and rejecting that is not harder than replacing it with "./bar\;\ rm\ -rf\ .". It's just calling escape_file_name instead of validate_file_name.
So:
1) any productivity benefits, provided by kernel file name limitations can be acquired by convention. (What UNIX world is doing)
2) such limits won't make anything safer. Building shell command by blind concatenation of user provided data will still be unsafe. If user is trusted - case 1
3) files are used not only by programmers. Imposing such limits will either degrade user experience, or lead to display name !== actual file name, leading to indirections and kludges much worse than touch -- "$FOO"
If somebody creates files with carriage return in source tree because kernel doesn't stop him, problem is social, not technological.
Your position boggles and dismays me. I have seen so many heinous bugs that appear only intermittently, and are nigh impossible to track down, due to this kind of issue. The problem absolutely positively is not social. It is technical. The only social thing about it is that people persist on taking the wrong side on this issue.
As to having different rules for filenames in different places, that is just nuts. Programs should not be fragile, and shouldn't have subtle edge cases. Having software work that way has all sorts of downsides and hidden costs. E.g., people need to remember a lot more. More documentation is needed. Things go wrong when they didn't have to. All of this costs time and money and helps to sap enthusiasm as people track down chimeras they shouldn't have had to.
Furthermore, one of the prime use cases for scripting is by system administrators, and such scripts need to handle all files. The stories of sysadmin scripts that have run afoul of files with strange filenames is legendary.
Regarding your example with "process $FOO": that's completely a red herring. You might as well assert, "We can't solve everything, so we should solve nothing." In this particular case, we were talking about the problems caused by filenames that are hard to deal with in a scripting environment, not about programs directed by user input. The first problem is easily solvable once and for all, while the second problem is less so and will always require care. Just because some things require great care does not mean that we should make all things require great care.
I just can't fathom that there are still people who actually argue for a world that fosters subtle bugs and lack of robustness. It is downright wrong, and it may someday be our undoing. Quite literally.
Spaces (parenthesis, semicolons, bangs etc) in file names are not subtle edge cases if you consider system as tool for reaching user's goals. Programs have to process file names with spaces not because kernel aesthetics, but because users want and need files with normal, readable names.
Actually, I would happily agree to ban \n as edge case - it's useless for end user and a readable file names separator is needed for scripts (like \0, which is forbidden because it is useless for users and extremely inconvenient to work with in C).
What if users want rich text in their filenames? Why shouldn't they have the ability to do that? And certainly they want slashes in their filenames! But Unix doesn't give them that either. Horrors!
What people want most of all is reliable, robust software. Features that don't work right are worse than no feature at all. What you fail to consider is that every feature has a cost. In this case, the cost was WAY too high. If this cost is to be paid, then it should have been paid in a lower-cost manner.
Contra to what you say, I'm perfectly sure that users would have dealt file with having more limited filenames. In fact that did quite fine with 8.3 filenames for many years. I must concur, however, that those were more limiting than humans should be forced to adapt to.
This being said, I have nothing against giving people the ability to have all of these things in the display name for a file, if it is deemed that the extra flexibility is worth the trouble. This extra flexibility just shouldn't be in the unique identifier for a file. There are perfectly good ways to provide this capability in a manner that has far fewer costs.
Alternatively, I'm not opposed to adopting the attitude of the kernel hackers and shifting the burden onto the shells to generate such meta-character-free identifies from richer display names, but if that was the way it was to be, it would then have been essential that a standard library for generating such unique identifiers from display names have been created, and that the shells uniformly use this library.
There are perfectly good ways to provide this capability in a manner that has far fewer costs.
Not really. It's either some specialized tools (throwing away all environment uniformity benefits) or another layer of indirection (display name->real name->inode), with it's share of bugs (and having two close, often equal, but different identifiers won't make programming any less error prone).
shells to generate such meta-character-free identifies from richer display names
You still need to pass rich display names to shell, so old problems are still there, and, on top of that, consistent mapping of display names to real ones is required.
In this case, the cost was WAY too high
-- "$FOO" instead of $FOO, and ls -1 instead of ls?
(Not accounting for \n here, because in that case i agree on it's abysmal benefit/cost ratio and banning)
if it wants to interoperate with filesystems created by other operating systems?
That, of course, was not an issue back in the day, as Unix did not in fact support anything such thing.
As for today, please do tell me how Unix can interoperate with filesystems that allow nulls and/or forward slashes in filenames. I'm sure that you might come up with an existing or potential solution, and well, there's your answer!
> The OP claims that they've solved this dilemma in Julia. I await with bated breath.
The obvious solution would seem to be to provide an easy way to directly create a pipeline from within Julia. I.e. you use syntax similar to shell syntax, but rather than spawning a shell Julia spawns the desired processes directly. That way it can directly pass arguments to execv (avoiding escaping issues altogether), wait on all of the child processes individually (allowing clean and detailed error detection), and avoid the overhead of actually starting a shell.
I think the silent failures are the worse of the three by far.
Also, related, very few unix commands have a consistent easy to parse format.
I have a dream -- a unix-like OS with a strong convention (or even enforcement!) of file typing and format. Optional perhaps.
The efficiency aspect is often exaggerated. As long as you do a fair bit of work in the spawned command the fork/exec cost disappears to nothing. Also, it should be rare in any decent language.
The efficiency aspect can creep up on people easily though, regardless of language. It's the standard concern: you don't know how your code will be used in future.
Have you used Microsoft Powershell? It (indirectly) does a useful portion of what you're dreaming about. It's a UNIX like shell where commands pipe .NET objects instead of plain text. Just like UNIX commands take and produce arbitrary text but most think of it as a sequence of lines, Powershell commands can take and produce arbitrary .NET objects but most use an object which is a collection of records.
Powershell comes with builtin commands for most of the simple UNIX utilities, and you build new commands in any .NET language. Additionally, you can interop with UNIX style programs that produce plain text by using a parsing filter (there are a few builtin, you can make more).
It's actually pretty good and a lot more regular and modern-feeling than any UNIX shell I've tried. I still prefer to use GNU/Linux, but wish there was a good port of Powershell...
This article is nonsense. In libguestfs[1] we invoke shell commands for many filesystem utilities ... from C which (if you believe the article) would be the worst place to do this.
All of the problems are avoided by having a smart way to run external commands and capture their errors. See the functions [2] and examples of use [3] [4].
I can't really agree with the article. I use Ruby a lot for small one-off utility programs that might be used for a day to get something done. Using the back tick quotes to run external processes and capture stdout, I can solve real problems in a few lines of code.
I don't think that I have every been bit by the "silent failure" problem mentioned in the article.
A little off topic, but I don't throw away small one-off utility programs when I am done with them. Instead I rename them to long and meaningful file names, toss them in a special directory, and save even more time in the future by hacking on them for similar purposes.
I may be typical: although I usually design and implement robust systems with a long life, I also work a lot with data and this requires a lot of sloppy little bits of code to change data formats, get statistics, etc.
I do agree that shelling out in production code would suck.
The trouble is that so much production code starts as a one-off. Worse still, in a startup environment there's often not much distinction between production code and one-off code.
The article is quite correct, of course. But this kind of bug gets tiresome. We saw the same kinds of troubles with SQL for years. And we'll see it again the next time we get a popular technique that involves code that generates code using outside input, and passes it as plain text.
So we need a new rule: If you're going to provide an interface that allows programs to generate code and pass it to something, then you need to deal with these problems proactively. You can provide sanitizing/escaping functionality. Or you can avoid passing code as plain text. Or you can do something else just as good. And proper error handling needs to be the default.
But if you don't address these issues, then your interface is lousy, and it deserves to be treated as such.
Yeah, I kind of figured Haskell would have solved this problem in a way that makes the typical syntax do the right thing (shell-quote the interpolated variables) and then provides an override for those few cases when you're doing something weird.
Avoid shell-escaping and the shell altogether! Use list form program execution wherever possible.
In Perl:
In Python: If you're executing a pipeline, you do need to get the shell involved. But I'd probably write the author's examples like so (once again avoiding the need to shell escape arguments): 2. Prefixing each multi-command pipeline with “set -o pipefail;Alas, pipefail is not in POSIX /bin/sh, which many of us prefer for shell scripting. It's in bash and ksh though.
3. Explicitly checking for failure after each shelled out command
Valid. You should always do this. In shell scripts, "set -e" is a good thing, although pipelines without "set -o pipefail" are still a problem.