If an API has a problem, you fix the API. If necessary, you release a free library to back-port the improved API to older OS versions as well. You come up with a fixed version, make it as convenient as possible, call it the “new standard”, officially deprecate the alternatives, and then write a blog post. Except the blog post would only need two lines of sample code showing how easy it is to work around the problem now.
Not this. Frankly, few developers will even know about the need for careful coding such as this, and even fewer will actually do it because it will muck up each and every program with dozens of lines of extra stuff to work around a deficient part of the PLATFORM.
I'm absolutely amazed that Windows doesn't offer a process spawning API that takes an array of strings as arguments[0], if only because that's exactly how a C program expects them anyway.
The problem is that parsing the command line isn't done by the operating system at all, but delegated to each individual console application. If you made a new CreateProcess API that took an array of command line arguments, the operating system would need to serialize the array into a single string that could be passed to legacy commands. Unfortunately, there's no way to tell what weird parsing is buried inside the commands, so there will always be gaps in what you can express in the serialized command line.
For example, suppose the console app thinks that apostrophes should be treated as quotes. I pass "'x", "x'" into the new API, and that gets serialized to "'x x'" (two strings to most applications), but this particular app interprets that as one string that says "x x". The OS can't even escape the apostrophes to avoid this, because it doesn't know what language the console application speaks.
PowerShell had to deal with this problem because native command lines need to be rehydrated from its AST before they can be passed to CreateProcess, and (IIRC) they ultimately had to add an operator that means "everything after this point in the command line should be passed to the command verbatim" to cover all of the corner cases from this.
These APIs do exist, they just don't work that way:
> These functions appear to be precisely what we need: they take an arbitrary number of distinct command line arguments and promise to launch a subprocess. Unfortunately and counter-intuitively, these functions do not quote or process these arguments: instead, they’re all concatenated into a single string, with arguments separated spaces
When you get to the Windows kernel, the command line is a single PWSTR. Full stop.
Any API or C program main() running on Windows that suggests anything else is fiction - the C runtime parsing what the kernel gave it to a string array on one side, or concatenating into a single string on the other.
Well, it's actually a UNICODE_STRING. ;-) The limit on the length of the command line comes from the range of the Length field of the UNICODE_STRING structure. (NT uses Pascal-style strings internally.)
NT's native process creation functionality is powerful, but baroque: see [1]. There's a ton of stuff that processes can be passed in addition to the command-line. One trick that's not well-known is that CreateProcess allows parent processes to pass an opaque binary blob to subprocesses via the lpReserved2 member of the STARTUPINFO structure. Cygwin uses this blob to pass information about file descriptors, ttys, and other POSIX context; this information block bootstraps Cygwin's fork implementation. The Microsoft C runtime uses it for a vaguely similar purpose: it's how file descriptor inheritance works when neither NT nor Win32 know anything about file descriptors (which are private to libc).
I was a dev in Windows from 2008-2011 so I am not sure you are aware you are replying to someone who is already a big fan of the NT native API (and not so much a fan of the crude hack that is Windows CRT file descriptors, like most things in the MS CRT...) I did mean to type PWSTR in full awareness that I'm using it as a figure of speech for UNICODE_STRING.
The terrible thing is that there already is an API here that everyone else on UNIX is using successfully - spawn a process with an argv of null-terminated strings and they turn up in the argv of the spawned process.
Microsoft have just chosen not to make it work like that because that would involve admitting they were wrong.
Exactly. While they can't fix the old functions without potentially breaking software that relied on the old behavior, they can introduce a new set of process-launching functions that do the right thing: handle arguments exactly as written, no splitting, no quoting, no metacharacter interpretation.
Then, for new OSes, write a compatibility layer to implement the old functions on top of the new ones, and for old OSes, write a compatibility library to implement the new functions by quoting and passing to the old ones.
Mark the old functions as "deprecated, do not use in new code", and point to the new ones.
This is something that is important. The steps to go about.
1. Fix old solution
2. Implement alternative that is feature complete and document
3. Write comparability layers
If people followed this life as a software developer would be easier on bloodpressure. I remember fondly working with many APIs thst were deprecated but had no alternative and the devs admitted it.
They even copied that broken system into PowerShell "functions" (proper functions are expected to only return explicit results, these return every output they gather, falling behind structured programming).
I've never understood the Win32 platform team's resistance to adding an ArgvToCommandLineW function to mirror the longstanding CommandLineToArgvW[1] function.
That is a bit more difficult than usual in this case, because it's not just one API, it's two working in unison: the one used by the calling process to pass arguments, and the one used by the called process to receive them.
At present, they're both "everything in one big string"; if an alternative "array of strings" API were added at both ends, you'd need to come up with shims for when the caller passes a string & the callee expects an array, and vice-versa. It's not immediately obvious how you'd do that in a way that works reliably in all cases, especially considering the callee currently has total freedom to parse its command-line however it likes.
Looking now at how Windows handles console application arguments, it sure looks broken. But you have to put your mindset in cca. 1990 and think what Windows applications looked like back then, and what was the model Microsoft was betting on. Arguments were passed via DDE[0], and then later all the bets were on OLE[1] and finally COM[2]. System components were all the time accessed via in-process DLLs communicating with services over LRPC[3]. In this world, the command line, the pipe philosophy and the 'less is more' mindset were not only not welcome, they were the adversary.
Even when finally it was aknowledged that the command shell needs some love too, the answer was PowerShell, which yet again defined an object interface between cmdlets[4].
You know, in all these years I've never bothered to find out how DDE worked, having used COM instead, and now I find: https://msdn.microsoft.com/en-us/library/ms648774.aspx and it's a terrifying abomination built on wparam/lparam.
> In this world, the command line, the pipe philosophy and the 'less is more' mindset were not only not welcome, they were the adversary.
I agree that this is Microsoft's greatest heresy and also an effective tool against interop.
>promoting a overcomplicated bad ideas over simple correct ideas.
Encode the program name together with all arguments as one big string, discarding all type safety, spawn the process of your favorite shell, let that shell process the string in order to spawn one or multiple processes, in each process have the standard library process the arguments into an array called argv, and have your average process call out to yet another library to parse these string argument strings into flags and parameters, and prints a not standarized, potentially localized string if an error occurs, and if the error is fatal exits with a semi-standarized return code. The calling program tries to make sense of the output of the called program, often by fuzzy matching against known output.
That's the standard way how it's done since the beginning of Unix. Some APIs skip the shell, but that's a minor detail in all this. If this sounds simple and like the obviously best solution, then congratulations. The windows developers disagree and tried (and continue to try) to find a better way. They mostly failed so far, but I think we should thank them for at least trying to innovate.
If you use some API other than system(3) - even if you literally use a shell-script - it's:
1) serialize data structure to semi-typed array of strings
2) pass array of strings directly to target program
3) have target program parse the array of strings to flags & parameters
With no shell or C library touching the command line arguments at all.
The only way this could be more direct would be if you used JSON instead of arrays of strings, web-API style.
Trust me: it's not deliberate. There is no conspiracy. Everyone I met on Windows is at least as well-intentioned as anyone working in the POSIX world. The reason Windows has some bad APIs is the same reason Unix has bad APIs: someone bootstraps a system quickly and doesn't see the problems that can arise from their choice of APIs; the system becomes wildly successful; and now everyone has to support these ill-conceived APIs.
Sure, Windows command-line argument passing is bad, but have you ever tried using wait/waitpid/wait4/waitid/etc.? That's a nightmare in the POSIX world; Windows has nice, clean process handles, not the garbage /proc stuff that makes it fundamentally impossible to write a safe pkill(1).
If you're writing a brand-new system, for the love of God, do a good job of designing the APIs. You willnot have a chance to go back and fix the APIs later.
Uh, so much truth here. If I added music, this would be like the ballad of software. Once the cement has hardened on an application's design, reinforced with industrial strength needy customers who only want the next feature...that is game, set, match.
To list quickly from memory (and my game machine experiences) - all from windows 10 pro.
* new install with ms office, I run autoruns.exe and see that I have 70+ things being run on startup. Yikes. Default install.
* registry: multiple things, biggest for me: you can't easily move part (or all) of it between machines, due to being tied to specific machine: imagine in unix you can't move whole /etc between to a newly installed separate machine.
* still lack of decent package management (check chocolatey (BTW they try to do some work here): among dozens of problems they have, they still (and in fact probably never) can't tell you simple 'list all files of given package'. Don't event want to start laughing about appx microsoft newest invention: it does not return even HALF of default installed apps of new laptop.
* logging. During update 1607, process is stuck (6hrs, and still 'please wait'): no simple place (log) to analyze what's happening (and no, eventlog is not such place). All daemons and windows do not have sane logging with enough information constantly being written to logfiles to analyze problem (this is big and deliberate).
* file system mess: e.g. system drivers running with kernel permission installed in 'program files' (new dell xps from 2016) and also usual common day programs being installed in c:\windows - while windows happily allows it - again, clean new install :)
* naming of services/technologies and their (microsoft) general approach to architecture design (boundaries and namespaces): this is mess, one example among hundreds: check what is short name of background transfer service (the bits one) you need to restart if windows update (sic!) stops working - no, it's not bits or bts :)
This is only written on the phone high level things, There is on the net comprehensive listo of 500+ things could have been fixed.
Everyone does this "wrong" because every app does this differently. The core reason windows command line options aren't (or at least shouldn't be) used to pass complicated data is that way back when DOS simply provided a single command line string to the executed program and let it parse it itself. So no two command line parsers are the same. The glitch with escaping here is merely one symptom of a broader problem.
Unix got this right by forcing the shell to provide the kernel a pre-parsed list of strings, so the only insanity the tool integrator needs to understand is the shell's quoting syntax. Which is still insane. But it's only insane in one particular way.
On Unix, most tools don't really need to use the shell at all; it's enough to treat argument lists as lists internally and pass them to exec or posix_spawn (which of course, unlike the Windows _exec and _spawn, aren't broken for arguments with spaces!).
However, at minimum it is still useful to shell-escape arguments when displaying them to the user (for ease of copy+paste), so it's unfortunate that many languages don't have any standard library function to do this - including C on POSIX, and Python before version 3.
Alas, no. Bourne shell syntax allows for double quotes with variable interpolation and some other fancy syntax (including backslash escaping of literal double-quote characters), and a single-quote syntax for "raw" strings with no fancy syntax INCLUDING backslash escaping.
So your rule won't work. You can't single-quote a string that itself contains single quotes, which makes for some fun when you have arbitrary strings (file names are the big frustration) that need to be substituted into a parseable command line.
But like I said above: Bourne syntax[1] is only one kind of insanity, which is still much better than the DOS/Windows world of a separate parser for every app.
Their rule does work. The single quotes are transformed into '\''.
The first ' closes the current single quotes string. The \' adds a single quote character outside a quoted string (and outside single quotes you can use backslash escapes). Finally, the last ' reopens the single quoted string that we closed before.
_exec an _spawn are not Windows; they are functions in the MS Visual C redistributable run-time.
(Well, they are also in a system DLL called MSVCRT.DLL. That is an internal library which is undocumented and considered by Microsoft to be off-limits to applications.)
Unix got something right in that you can unambiguously pass a list of separate strings to launched processes. However, it does nothing to ensure unambiguous meaning of those strings.
This is for example why you should avoid giving your files such cute names as '-rf'.
Firstly, its IEEE standard (1003.1 or "POSIX") specifies the -- convention for separating option arguments from non-option arguments. The tiny handful of utilities like "echo" which do not implement it are also documented that way.
Secondly, Unix provides the POSIX standard getopt C library function, and getopts command. Programs and scripts which use these standard functions for processing options will implicitly support the -- convention.
Developers of new command line programs can ignore the documentation and standard functions, of course, developing their own non-conforming parsing from scratch. But at least users have something to point to if they report that as a problem: look, your program isn't supporting --, meaning that you ignored both the POSIX standard convention and the library function which enforces it.
Yes there is. You want the filesystem to be flexible. If the shell doesn't like those characters, use a different shell that doesn't care. It's brain-dead to create a filesystem that prevents flexibility in user interfaces.
Flexibility is only a good thing if the benefits outweigh the costs. I insist that there are no legitimate (i.e., no better option) use cases for control characters in file names. The filesystem being "flexible" is not a good thing if flexibility causes real problems.
That's not a problem because the UTF-8 encoding of U+3707 will absolutely not contain any USASCII control characters, or any special shell or filesystem characters. It will all be bytes in the range 0x80-0xFF.
There are other encodings than UTF-8 though. Which is kind of my point. If you have your file system set to UTF-16 (doesn't NTFS do this?) then 0x07 will be present.
I also believe that filesystems should require that all filenames be fully normalized UTF-8. I don't think the benefits (slight, IMHO) of allowing filenames to be arbitrary byte strings outweigh the costs of code complexity and security problems.
The operating system could address it by having a separate argument list and option list at the kernel level, creating an unambiguous interface for calling a program, giving it a list op options and non-option arguments.
Ambiguity would remain in how a given shell parses input to determine what are options and what are arguments: but this would at least be out of the control of individual programs. Notably, the shell would be the tool which parses the -- convention. Programs wouldn't see the -- delimiter which separates options from non-options, so it would be impossible for a program to neglect to implement support for --.
Yes, programs are free to interpret arguments any way they want. (See dd(1).) But in practice, almost all programs interpret a leading dash in an argument word to mean "here be options". By banning filenames with leading dashes, we close a large number of security holes at minimal cost. Of course it's not a total solution, but from a pragmatic perspective, it's the right thing to do, because it goes a long way toward solving a real problem.
It's harm reduction. Yes, everyone should be escaping input. Yes, everyone should be using "./.foo" instead of just ".foo". But people don't, and they're not going to start. If we ban leading dashes, we stop these bugs from turning into security vulnerabilities.
Your stance is like being against ASLR because developers just shouldn't have buffer overflow vulnerabilities in their code.
What does it mean to say that an argument's meaning could be unamibigious regardless of the program it is passed into ? that's a logical impossibility
Still, it would be right and proper if Unix programs a little type-safety in their arguments, for example by requiring that ALL arguments be flags, as in this hypothetical smart_rm: "/bin/smart_rm -rf --pattern foo/bar"
This depends on the application recognizing the -- convention and also depends on having all the little scripts in your system remembering to use the --.
Even if the OS kernel provided a process launching API with separated options and arguments, that would not remove the need for the -- syntax to remove the ambiguity at the shell level, and hence your need to use that in scripts.
It would remove the problem of programs all being required to implement --.
This is a quite good example of the kind of problems you run into when you follow the philosophy of representing all data in informally-specified ad-hoc text formats. Everyone thinks they can just roll their own parser/serialiser, which they then neglect to test thoroughly enough, creating subtle bugs when the serialisation side forgets to escape data somewhere, or the parsing side doesn't even provide any way to escape grammar-significant characters.
No, the problem is the lack of a standard encoder function to go with the standard decoder function. It's not "everyone thinks they can", it's "everyone has to".
The problem is also YAGNI and validation-thru-testing, instead of up-front design. The "ad-hoc" and "text" parts aren't what's important, it's the whole approach of not doing any more than the bare minimum of up-front work. Which seems to historically give overall better results, even if it does come with interesting bugs that need fixing later.
Well, the real problem here is that there's not really a 'standard' you can rely on in the first place.
And while it's true that text isn't a necessary part of the general problem, in my experience text-based formats seem especially prone to it. How many times have you seen people attempt to use regular expressions to parse HTML/validate e-mail addresses/whatever?
I think this indicates a huge difference between Microsoft/UNIX mindsets. Microsoft allowed
rename *.txt *.bak
To do this, the "rename" command had to understand how to parse the asterisk character while being familiar with the contents of the directory.
However, creating a new replacement "rename" command is difficult, as well as creating new commands that can parse wildcards.
In the Unix environment, the shell expands the asterisk to all files that matches that pattern, and then passes these files to the "rename" command, who never sees the asterisk.
Therefore it's trivial to create a new "rename' utility because it doesn't need to parse wildcards. However, renaming all .txt to .bak is awkward in a UNIX system.
It may be awkward but it's no use to ruin a perfect system for just one usecase. For what it's worth, zsh provides a capable renamer tool called zmv. An example:
zmv -W '*.txt' '*.markdown'
And of course there are tools like rename(1) that work regardless of the shell used.
That's not at all user-friendly, but these are supposed to be programmer tools... If you want user-friendly file operations on UNIX command line, use midnight commander. It can do mass rename, etc.
I don't see how the problems you're listing have anything to do with text formats; you have these kind of issues with any "informally-specified" format.
That's about Windows, but many uses of system() that involve non-static strings also probably get quoting wrong.
Of course, avoiding the shell is the best way to avoid the problem. Sometimes, you can't avoid the shell.
My preferred shell quoting method, for unix, is to wrap each parameter in single quotes. Then only single quotes inside a parameter are a problem. They can be replaced with '"'"'
Probably a lot of things use double quotes and perhaps try to escape $ and ' and " but miss details to do with \ and perhaps other characters that some shells treat specially.
Another way is to pass the filename in the environment: system("rm -rf \"$DIR\"")
My favorite awkward environment to cite is running a command remotely over ssh. As far as I've been able to tell from casual testing, without having read the source code yet, ssh does something very similar to what Windows does here and just glues everything together with spaces and passes it to the remote shell for interpretation, so you have to deal with the shell and provide your own quoting.
That's correct - AIUI, the ssh utility has to smush the command & arguments into a single string, because that's all the protocol's "exec" request can handle:
And I guess it can't really do any quoting/escaping, because there's no guarantees in the protocol as to how such things will be interpreted server-side - the command line could be interpreted by cmd.exe on the server for all `ssh` knows ;)
With that in mind, I've made it a habit to always quote the command-and-parameters part of `ssh` lines into a single argument - I figure, that's what it's doing anyway, so it's best to be explicit about it. And when I know I'm working with bash on both ends, my pattern of choice is
> My favorite awkward environment to cite is running a command remotely over ssh.
Dude, if you run remote commands by calling them through SSH, you didn't just
got things backward: you fucked things up heavily. SSH was never ever designed
as a batch, unsupervised tool, despite many people using it as such.
Remote code that is parametrized should be run exactly as that: as a remote
procedure call, a technique known for over thirty years now. One of the
reasons is quoting (because for non-interactive SSH call the command needs to
be quoted exactly twice if in shell, and exactly once when run from exec()),
but there are problems with distributing keys, maintaining usable home
directories, and disabling unnecessary things that are enabled by default
(port forwarding, among the others), and that doesn't exhaust the list of
issues.
Proper RPC protocol, like XML-RPC (which was released twenty years ago and
is still usable while being quite simple), covers quoting -- or actually,
serializing data -- without programmers worrying if they got their list of
metacharacters right and did enough passes for things to work correctly.
On the other hand, I'm not surprised that people do this through SSH (and
a variant of this stupidity: adding apache user to sudoers, so a web panel
can add firewall rules). After all, I've never seen an easy to use RPC server
that has all the procedures passed in its configuration. I needed to write
such thing myself (once in Perl, as xmlrpcd, and recently in Python, with
custom protocol that can do a little more, as harpd of HarpCaller project).
And yet, sometimes you're working in an environment that already has a good ssh configuration for other reasons, and you're very low on engineering time that you can invest into something, and ssh is good enough for a first pass implementation. Alternately, you may be working on some kind of ad-hoc data collection or maintenance task that's not going to become part of any long-term infrastructure (or will be replaced by something better), and you don't yet have any better systems in place to run ad-hoc programs across the cluster.
I completely agree with you that good RPC is a much better foundation to build reliable systems on.
[edited to add]: HarpCaller looks like a pretty interesting project, and similar to several things I've considered building in the past. Nice work.
> sometimes you're working in an environment that already has a good ssh configuration for other reasons, and you're very low on engineering time that you can invest into something, and ssh is good enough for a first pass implementation.
I would agree personally before I wrote xmlrpcd. After I wrote it, I, its
author, have no excuses for using SSH as an RPC protocol. Though I'm not good
on the marketing side, so I understand that people just don't know about such
tools.
> Alternately, you may be working on some kind of ad-hoc data collection or maintenance task that's not going to become part of any long-term infrastructure (or will be replaced by something better), and you don't yet have any better systems in place to run ad-hoc programs across the cluster.
Honestly, this is yet another matter.
To properly manage a set of servers, one needs three different services[&],
each for different thing. One service is for running predefined procedures
(that can possibly be parametrized) -- this is what HarpCaller and earlier
xmlrpcd are for. Another service is for managing configuration and scheduled
jobs -- this is a place for CFEngine and Puppet. Then there is what you just
said: a tool for running commands defined in an ad-hoc manner and collecting
their output synchronously. From the three, the first and second don't match
how SSH works and is used, but for the last one it actually makes sense.
[&] It doesn't have to be three services, but we don't have one that would
cover all three in an uniform way.
> [edited to add]: HarpCaller looks like a pretty interesting project, and similar to several things I've considered building in the past. Nice work.
Thank you. I'm quite proud of how it turned out, and the middle part of it was
an excellent pretext for me to write something for production use in Erlang.
In the case of XML-RPC, authentication mechanism is quite obvious: HTTP
authentication. Permissions are hardly a problem, even less so than in the
case of SSH, because you don't give the caller full shell, only a small set of
well-defined operations. And definition of call interface does not magically
go away when you move to SSH, so I don't know where did you came from with
this argument.
That being said, SSH has plethora of problems as an RPC mechanism. Host key
distribution sucks heavily if you have more than a handful of servers. User
key distribution is even worse, unless you incorporate external mechanisms.
You need to maintain a usable home directory for a service, which otherwise
woudn't need such a thing. SSH has plenty of obscure functions, like port
forwarding in three flavours, X11 forwarding, VPN baked in, and others. And to
use SSH-as-RPC for a service you need to disable them. Are you sure you
have covered every single one of them? And then there is also mixing
a debugging channel with regular operations. Break just one and you cannot
recover (and it's easy to break a debugging channel, as you want it
reconfigured and limited to only allowed accounts and whatnot). Those two
should be separated.
The only thing that SSH has better than XML-RPC is streaming a response. But
first, it's a rarely used function for a setting where you need to execute
a remote operation, and second, because it sometimes is actually useful,
my HarpCaller (RPC daemon) needed a custom protocol.
SSH is very, very far from being an excellent protocol for running
predefined remote operations, even if it were only for issues with quoting,
which are difficult on their own.
Oh, quite contrary. They do need it, otherwise you have a system that just
waits for breaking apart.
However, I agree that quoting for ad-hoc synchronous commands to be run
through SSH is very troublesome and tiring, especially when one doesn't
understand how the commands are executed through non-interactive SSH (and most
people don't).
You know what's the difference between XML-RPC and SSH with regard to payload
encryption and client authentication? Only the fact that SSH has the two
covered by mandatory parts of the protocol, and XML-RPC has this part
optional (HTTPs and either HTTP authentication or client certificates).
Nothing prevents you from exposing procedures only through HTTPs and only to
authenticated clients. In fact, my xmlrpcd works exactly this way.
Security-wise, your advice to use something that makes building correct
system virtually impossible (because quoting issues, unnecessary features
enabled by default, and others) is simply stupid and dangerous.
As well as running things over ssh (which IIRC requires double-shell-escaping), su -c and similar take a single parameter containing the command to run.
I'm going to use this opportunity to ask a question I've been thinking about for a long time. Why do we have both environment variables and command line arguments? They are the same thing, except one is key-to-value and one is positional and often needs to be parsed by hand in an ad-hoc fashion. I don't think that people should use command line arguments when environment variables are an option, and I'm not aware of any use cases where they are not an option.
Quickly, I can think of one reason - command line arguments override environment variables whenever conflicting options are present (programs should be written so that this is true). This gives flexibility to run a one-off command with specific options without having to reset and re-reset environment variables (of course, you can also deal with this by launching new shells that'll have the one-off environment variable settings without tampering parent shells).
There's a longer debate about this topic on Stackoverflow titled "Argument passing strategy - environment variables vs. command line". [1]
— It is possible to “see” a command line (in "ps" output, etc.) in ways that you won’t see environment settings. This can be useful when passing information to a sub-process that you need to keep private.
— It may be that you are configuring your sub-process in a place that is “far away” from the point that actually executes the command. Rather than have to thread an extra command-line argument through your code to make sure it is part of the final command invocation, it can be quite convenient to just set a variable. I have often used this to enable debugging features or test experimental features, or even to disable entire features when unexpected problems arise.
— In a similar way, your program may use multiple languages or otherwise be difficult to manage in any common way without environment variables.
— In a cross-platform scenario, environment variable names might be far easier to keep constant across UNIX, Windows, etc. than command-line syntax.
The main argument, to me, for command-line arguments is that they're not automatically inherited by child processes like environment variables are, so you don't have to rely on every process tidying up its environment before executing anything else. To me, that just seems like a recipe for heisenbugs and spooky-action-at-a-distance.
The POSIX shell has nothing to do with the syntax of the command. You can write a shell script that can parses
cp from=foo to=bar
if you wanted to. The shell expands metacharacters and variables, and sets up STDIN/STDOUT. But = is not a meta-character, so it is pased to the command unchanged. Several UNIX commands use that sort of syntax - like dd(1).
With the "env" command it’s possible to set environment using key=value for anything (it just means you have to say "env a=x b=y cmd -arg1 -arg2" instead of expecting "cmd a=x b=y -arg1 -arg2" to be valid).
You may already be aware of this, but it's trivial to read a process's environment variables, for any process running as the same user, or when root. They're exposed as a null-delimited text file in /proc/$pid/environ. You can even get ps to print the environment variable for you, if you use the 'e' flag (no leading dash). Depending on your actual security constraints, this may be important to be aware of. Of course, there are a variety of options for reading a process's arbitrary memory locations, so for actual security you need to control accesss to the host, but if you're worried about leaking 'ps' for command line arguments, you should be similarly aware about 'ps' showing environment variables.
Yep, that's why I mentioned "as the same user". It's slightly less of a risk of data exposure, but it's worth being aware of when evaluating your threat model.
A more interesting question is, why do we pass in command arguments and environment variables to a subprocess, but only get an integer status code back? The original reason comes from the way fork/exec was implemented in PDP-11 UNIX. But that was a while ago. At some point, the subprocess concept should have been extended to handle return values, like all other forms of function call. "exit" should have an argc/argv, which get passed back to the caller.
It's always bugged me that the POSIX exit status can only really communicate seven bits of information reliably. (The other half of the byte is overloaded with signal-exit information from the shell.) Windows does it better: there, you at least get 32 bits, which is enough for an HRESULT.
Also, I've never understood where this exit(-1) idiom comes from. It's nonsense.
The reason is of course that returning a variably-sized value to an already-existing address space is annoying. IOW that's what standard output (and pipes on /proc/self/fd/N) is for.
Interesting idea, and would make it more flexible, like Python and other languages which can return multiple values (from a called function to a calling function or the module scope, not to the OS, AFAIK).
Think of it like nested context in a "normal" programming language. The environment variables are state or data from the outside or encompassing context. Whereas command-line parameters are just that, parameters passed down to a function based on some logic held within the parent context.
Within that explanation, we all know that global/shared variables are a code-smell for the most part. Say you want to call the same command with different logic, or multiple times even.
result1 = func(); //How do we even know that func uses greeting and greetee variables?!
greetee = "another world";
result2= func(); //Did func change my greeting variable? I don't know.
So let's assume that greeting and greetee are actual important variables. You are essentially then sharing your "state" with the func in order to alter its behavior. I think in some shells, the functions themselves can alter global environment variables, so it would be a giant mess making sure that functions are idempotent and don't have artifacts.
Environment variable affect all instances; command-line arguments affect only one program instance. Set your defaults with environment variables (or a config file), then override them as needed with command-line flags.
In the Unix world, environment variables are passed to ALL processes spawned by the parent, including sub-processes. For example, if you log onto a computer, and your HOME variable is set, then every single process you launch will know your home directory, including processes that launch other processes. It's automatic UNLESS a process explicitly change this value. This does not use any sort of global registry. I used to be an admin of a VAX computer that had 50 simultaneous users logged onto the server, and each user had a different HOME directory.
Environment variables also made shell scripts reusable by other users. The file $HOME/special would refer to the "special" file in the user's home directory.
Command line arguments are only passed to the one single child process. And if that process wants to launch a new process, it must create it's own command line arguments.
Environment variables are inherited to child processes by default, so you can think of them as arguments to a whole process group, not just to the program you invoke at the top level.
parameters are lexically scoped, environment variables are dynamically scoped. Today dynamic scoping is frowned upon as it is an instance of spooky action at distance, but in the 70s I guess it wasn't that obvious (and environment variables probably predate unix).
Also dynamic scoping can be very powerful when stitching together pieces separately designed. To this day emacs lisp is still dynamically scoped by default and arguably it derives some of its power from it.
>.txt gets expanded by bash into a list of arguments. I don't know if this is the case in Windows.
(that should read star dot txt in the line above, not sure how to disable the italics meaning of star in posts)
I don't think it is the case in Windows, and this seems not to have changed since DOS days, when some programs would be abled to handle wildcards (internally) while others could not, because it was done by the individual programs, not the shell (COMMAND.COM or nowadays CMD.EXE).
A quick test:
$ python -V
Python 3.5.2
$ type test_arg_list.py
import sys
print(sys.argv)
$ python test_arg_list.py a t b
['test_arg_list.py', 'a', 't', 'b']
(that should read t star (not just t) in both the lines above)
So wildcards are not expanded. I'm sure there are Windows calls to expand them (there were from the DOS days, like FindFirst and FindNext (awkward approach, IMO), but your program has to actually use them for the expansion to work.
In fact, that is what I did, via the Python glob module, in this recent post:
Simple directory lister with multiple wildcard arguments:
Whereas, in Unix, the shell (at least sh / bash) does it automatically for all arguments for all command-line programs, before the program even sees the arguments. This is one of the (many) key benefits of the shell. In fact, all metacharacters are interpreted by the shell and/or the kernel, acting together. This includes redirections, piping, the many special symbols that start with $, backquotes, and many others.
I think I may have the answer (at least for Unix). (IIRC had read about this somewhere a while ago, also had thought about this issue myself earlier, so it's a combo of reading the reason somewhere and (maybe) figuring it out. Anyway, here it is:
It is because it allows 3 different ways of setting options for commands: rc files, environment variables (env. vars from now) and command line arguments, with each subsequent one able to override the previous one. The logic being that they go, in order, from more permanent to less permanent (as settings). rc files (rc stands for run command, a term I think I read Unix inherited from some previous OS) are config files for commands, like .exrc and .vimrc for vi/vim, .bashrc for bash, .netrc and many more. Any command can create or require users to create its own rc file, and can use it if present to read settings. The setting in a file is less easy to change on the fly than an env. var (not really difficult, of course, just that you have to go edit that file in an editor - or use sed etc.), and an env. var in turn is (a bit) less easy to change than a command line option, when we are talking about multiple different invocations of a command, in which you want the values for that option to be different in some of the invocations.
Let's take the example of a setting for a port (for a network server or client):
First, put the most common and permanent setting for the option, say, PORT=8080, in the rc file, say .foorc (for command foo - whether foo is built-in or written by you).
Second, for times when you want to change it for say today's work, set (i.e. change) it via an env. var, like:
export PORT=8181
foo args ...
# this setting will remain in effect until you change the var or you logout/reboot, and as long as it is present, will override any PORT value in .foorc each time you run foo.
It can also be shortened to:
PORT=8181 foo args ...
# but this is now a one-time setting of the env. var, so will override any PORT value in .foorc for this run only.
# In both the above variants, the args will not include PORT, since the foo command will be written to check for an env. var called PORT internally (and similarly checks for a PORT setting in .foorc before checking for an env. var called PORT, with the latter overriding the former if both are present).
And third, for the time(s) when you want to change the PORT setting on the fly, maybe just once for today, do:
foo --port 8282 args
which will override the settings for port (if any) in both the rc file and the env. var.
So the order is: command line option overrides env. var. and env. var overrides rc file setting.
This is what I read/figured out. It gives a lot of flexibility. Many Unix commands work that way. If you want your own to work that way, you have to write the code for it, like checking for presence of the rc file and for the setting in it, checking for the env. var with getenv() and finally checking for the command line option.
Actually Bash does have some issues with command line arguments if doing variable expansion. For example in Java apps it is actually fairly difficult to pass -DsomeParameter="something with a space in it" if doing variable expansion.
I'm trying to find the exact situation but basically if you have something like:
PROPS="-Dprop1=foo bar -Dprop2=bar foo"
java $PROPS SomeClass.class
You can try to escape with backslashes and what not but it becomes fairly hard if not impossible to expand multiple parameters correctly as a single variable. I believe one solution is to use arrays and another is just not use arguments and instead rely on other configuration mechanisms (env variables, files, etc).
You see this often rear its ugly head with daemon scripts.
(The quotes around `${PROPS[@]}` is important.) Do note, there is an unfortunate edge case here if PROPS is empty, you'll get a false `""` arg passed to java in that case. There's a less pleasant syntax that avoids that issue but I don't recall it off-hand.
(edit: Please read the replies to my post, I didn't think about the fact that this syntax is bash specific. Thanks to those who pointed it out)
Oh I am sure there are ways around it but the big issue is that almost all of the daemon scripts out there do not do it. That is you can't just set some configuration in /etc/default/some_daemon as the script will try to concatenate the command.
I tried to find a failsafe solution once while rewriting a daemon script and just gave up.
Sadly not: only quotes that appeared literally in the command-as-written affect word-splitting or are subject to quote removal, not those resulting from an expansion step (variable substitution, globbing, etc); so you'd end up with four elements in argv (or one if you double-quoted the variable reference), with literal single-quote characters in them. You'd have to do something unpleasant with "eval" to make that work.
It's a sensible rule, when you think about it - otherwise, for example, an expansion that introduced mismatched quotes would cause total chaos.
Yeah, that's the very behaviour I'm talking about - it just happens to be a problem in this particular case: the single-quotes in hk__2's PROPS variable would be passed to the executed command, not interpreted by the shell, but we wanted the latter here.
I think the parent misattributed where the fsckup was. Most probably it wasn't
bash or other shell as it is, it probably was broken /usr/bin/java script
(yes, it used to be a shell script) that looked for Java bytecode interpreter
in various places. As most scripts that come from a big company, its style was
terrible.
A friend of mine had to fix this recently for PowerShell, which had a regression that caused arguments you passed to programs to be incorrectly escaped to executed commands.
It is extremely common to get this wrong. Apache Portable Runtime even gets this wrong :/. (I haven't submitted a patch for this yet, but I intend to: I ran into it a couple months ago and then got distracted after working around it in my program by predicting what incorrect escaping might be performed by APR and compensating by adding quotes and escape characters to my input to their open process function... my build is statically linked so I don't feel bad about this temporary hack ;P.)
This should be "quoting command line arguments the right way for passage into applications developed using Microsoft Visual C, and linked to its C run-time library that parses the command line string and calls main or wmain".
There is no general correct way to quote command line arguments in Windows, because every application receives just a character string which it parses however it wants.
There is no single specification for the syntax by which arguments are delimited within the command string.
I dont really know what to say from this without risking being downvoted. But, this coming from a Microsoft blog is a little... awkward. DOS heritage + really bad shell implementations, well.. I avoid the hell out of using command line on any Windows. Luckily there is mintty.
High level Api's could easily provide this, e.g. the C#
Process.Start(executable, args);
Takes a single string as args, and has no overload taking an array as args, which formats it safely and correctly so that the receiving process would see the same string in its args vector.
So while it seems to be pretty easily fixable, it hasn't been.
The right way to do it would be to change the OS API into accepting an array of strings, and turn this function into an overlay that will have its parameters parsed and broken apart in user-space.
But yes, if MS ever fixes this it's more likely that they'll go into the route you described. I can't wait to see how many bug reports with crazy descriptions it will create, and how many more overlays will be written to fix the bugs preserving backward compatibility.
IIRC, Microsoft C (quite some years ago - had worked on a product using it) had different variants of functions to spawn or execute a process from another one - like the exec family - execlp, execvp, execvpe, execlpe, etc., which varied in things like fixed vs. variable number of args, checking environment vs. not, etc. I also remember reading about it in a Waite Publications book, The Microsoft C Bible, by Nabajyoti Barkakati. Not sure if the issues mentioned in the OP could be solved if those functions were present - need to check.
Edit: and the DOS / Windows exec family of functions was likely derived from Unix's exec().
What I find frustrating is how many MS tools freak at the sight of a quoted path when quoted paths are what "copy filename with path" (or whatever the context command is called) gives.
I recall seeing different behavior between the C runtime and CommandLineToArgvW. I don't remember what the difference was, but I remember it driving me nuts.
Oh. I was confused until I realized the publishing site was Microsoft. Apparently "Everyone" only refers to Windows programmers, and Unix/Mac/whatever programmers do not exist in this universe.
Looks like the title has been fixed but I was also confused before this thread filled up with comments confirming my suspicions that it was a Windows-specific issue.
It does seem rather misleading to say "everyone" does it wrong when it's specifically a problem with Windows APIs (though not terribly surprising from the Windows team).
Not this. Frankly, few developers will even know about the need for careful coding such as this, and even fewer will actually do it because it will muck up each and every program with dozens of lines of extra stuff to work around a deficient part of the PLATFORM.