Hacker News new | past | comments | ask | show | jobs | submit login
How Command Line Parameters Are Parsed (2016) (daviddeley.com)
147 points by nikbackm on June 6, 2017 | hide | past | favorite | 41 comments



Windows command line parsing is crazy. I've recently stumbled onto this post, explaining various pitfalls in parsing command line arguments in windows:

http://www.windowsinspired.com/how-a-windows-programs-splits...


The first example

  "She said "you cant do this", didnt she?"
Parses the same way in Bash:

  $ for A in "She said "you can\'t do this\!", didn't she?"; do echo -$A-; done
  -She said you-
  -can't-
  -do-
  -this!, didn't she?-
So I'd have thought this would be a given. It doesn't look so unintuitive – though I'm not sure if the results are the same for the same reasons.

So I went to take a look at the Bash source code. Which is (expectedly) pretty hairy. Top-of-stack quoting characters are referenced through-out, to enable a `pass_next_character` which seems to me similar conceptually to `ignore_special_chars`.


That of course depends if you count msvcrt as part of windows or not. I mean the fundamental interface, GetCommandLine, arguably works quite sanely in Windows, the crazy bits come from userland.


The unix section is missing a case: when invoked from login, arg[0] has a '-' prepended.

I understand not many people write a program suitable as a top level shell (for use in /etc/passwd) but for completeness...


No, it is not. This is not a distinct separate case when it comes to _command-line parameter parsing_, which is the topic of the article.

The cases when it comes to parameter parsing on Linux and Unix operating systems are things like:

* A POSIX-conformant shell parsing a command line, in a shell script or entered interactively, into an argument vector.

* systemd parsing a string in ExecStart= in one of systemd's service definitions into an argument vector.

* Things that parse .desktop files, as given in the article.

* execlineb or nosh parsing a file into an argument vector.

- http://skarnet.org/software/execline/grammar.html

- http://jdebp.eu./Softwares/nosh/guide/nosh.html

* init parsing single fields of records in /etc/ttys or /etc/inittab into argument vectors.

* cron programs parsing fields in crontab records into argument vectors.

and so forth. These are all subtly or blatantly different, and the headlined article has glossed over this to a degree.


Is this an actual separate case? How are top level shells invoked? Isn't this just a specific call to execve?


It is good example that argv[0] is just another normal argument that is passed by the caller of execve() and does not necessarily have to be related to the actual filename of program executed in any way.


Not sure if it still works that way, but years ago I used to overwrite the argv[0] param with my own data to change how the process appears on lists such as one shown by the 'w' utility.


Yep. There's a tool in djb's daemontools that uses this behavior - https://cr.yp.to/daemontools/readproctitle.html . It's kinda a pathological case, but it's a kinda cool way to assure that you can get some data from your processes, even on a totally horked system, as well as being fairly convenient from an old-school administration perspective (giving the ability to see what errors a program has encountered with just ps or top; Already likely to be the first commands run when something's wrong.)


I am not sure I understand your question. Of course /bin/login calls some version of exec. However /bin/login formats arg[0] differently. You can see this in ps.

I suppose most people don't talk to /bin/login any more unless they ssh someplace.


After Systemd/Linux we don't have /bin/login anymore.


Yeah, not a separate case but good to mention.


It'd be interesting to know why Microsoft implemented parameter parsing that way. Maybe they decided to let the executable parses its own parameters as an optimisation - it means if the executable doesn't care about its arguments (or doesn't have any, like most GUI programs), an extra step is saved.

Except now it's clear it was a premature optimisation, if it takes 10 pages to document the parsing behaviour, which probably also cost millions of wasted developer hours fixing weird corner cases.

Maybe at that time they thought, "well programmers can just call GetCommandLine() to get the args in a consistent way".


I suspect it is a remnant of the DOS days.


The "command line as a single string" dates to CP/M if not earlier:

https://en.wikipedia.org/wiki/Zero_page_(CP/M)

I don't think it's an optimisation, but a design choice of simplicity; in fact, CP/M (and DOS) would parse the first two arguments if present, and put the results into the filename fields of the two default FCBs, allowing programs which take two filename arguments (traditionally, an input and an output) to be implemented easily without having any commandline parsing logic of their own. More info on that here:

http://www.gaby.de/cpm/manuals/archive/cpm22htm/ch5.htm

It would be interesting to trace the origins of command-line parameter passing design even further, to the mainframe OSes that came before UNIX, but I'm not so familiar with them.

Edit: OpenVMS appears to also pass a plain string to the process, which is then parsed within: http://hoffmanlabs.org/vmsfaq/vmsfaq_015.html (section 10.3)


Maybe they could have designed it to have the ability to do it either way, i.e. based on a flag of some sort, the Windows either parses the args (does the shell-level interpretations such as of quotes and other metacharacters), or does not; in either case, it then passes the (parsed or unparsed) args to the command. That flag would have to be understood and acted on by Windows, though, not the individual commands, since the action has to happen (or not) before the commands gets the args. Might be clunky, not thought through the idea in detail.


> (Note: Avoid using printf when printing parameters, as the parameter may contain a % which would be interpreted by printf as a conversion specifier. The program may crash with a bizarre error, such as "runtime error R6002 - Floating point not loaded".)

It would be safe to use

    printf("arg[%d] = %s", i, argv[i]);


It should say: Avoid using user input as your format string. Never pass command line parameters as the format argument to the printf family of functions.


An error which I still think is ridiculous: argument list too long.


In the times past the command line was pushed onto the user stack of the new process, this was before the process was of course started. This means you were limited to the default available stack plus or minus whatever else needed to be pushed onto the stack for a starting process. In older kernels sbrk() (to grow the user stack of a process) could only be called on a process that already exists, not one that is being crated. This means your argument list could indeed be too long.


I vaguely was aware of this, but whatever the reason, it's ridiculous that in this day and age of gigabytes of memory, I still can't use shell wildcards in directories with a decently large number of files.


The limit on Linux is usually a bit under 2MB (ARG_MAX), other Unicies will vary no doubt and Windows is 32KB, and I'm sure you are aware that xargs is your friend.


Yes, xargs works on Unixen. I wonder if there is any equivalent on Windows (other than using Unix xargs via Cygwin etc.), or writing your own? This point didn't occur to me before, but it could be useful.


I wish Unix and co did it the Windows way - leave it to the invoked process to parse. I have an app that takes an SQL query on the command line and I have to remember I can't just type SELECT * because the shell thinks it knows what I mean by *.


I wouldn't change the way bash works because I'm already used to it. After a while, one gets used to it. It's like working in a hazardous environment, full of spinning blades and spikes coming out the wall at random intervals :)

But seriously, that's an accident waiting to happen. Like a less knowledgeable user trying to delete a file named *

And there are some really awful things you can do by abusing shell expansion. Like this example (WARNING: DON'T BLINDLY RUN THESE!).

Go to a directory with non-important files, and do:

  touch ./-f

  touch ./-r

  rm *
Since there's a good chance that all of your files have sensible names (that is, they don't start with weird symbols), that will expand to:

  rm -f -r [the rest of your files]
Oops. Kiss your files AND directories goodbye. That includes write protected files that would give rise to a warning if you tried to remove them. And in a cruel ironic twist, the files named '-r' and '-f' are preserved.

But the problem is actually made worse by the fact that most command line parsing libraries allow flags to be condensed into a single option (like -r -f can be condensed into -rf).

GNU utilities do this by default. This coupled with expanding * turns ANY file that starts with a leading - into a potential source of doom.

So, imagine that you have a file named '-truffle' in there somewhere. rm * expands to 'rm -truffle [your files]'. That is interpreted by rm as "rm -t -r -u -f -f -l -e [your files]" (note the presence of '-r' and '-f').

Gasp and horror!

However, your salvation is that rm halts if it encounters an unknown option (like -u). You can wipe your brow and sigh in relief, because you just dodged a bullet.

IIRC, the Unix haters handbook dedicates a whole chapter to these types of landmines, and the "funny" thing is that most of it still applies, about two decades after it was written.


Note that even on Windows, you still have to deal with the shell interpreting what you type.

The biggest downside of the Windows way, though, is that there stops being one answer to "how do I pass arbitrary strings on the command line to another process"?

Sure, most programs use CommandLineToArgvW but it's also hard to find information on how a particular program parses its command line.

https://blogs.msdn.microsoft.com/twistylittlepassagesallalik... is an overview of the fun that is command line handling on Windows.


> Note that even on Windows, you still have to deal with the shell interpreting what you type.

Hypothetically that is not strictly true. You could create a shell that would be a very thin wrapper around CreateProcess and not do any interpretation or parsing of the childs arguments. In unixy shell must do at minimum some degree of parsing to split the arguments.


That's your shell rather than the OS.

I've been writing my own Linux $SHELL and one of the features is that asterisks don't get expanded by the shell to avoid accidental bugs from the situations you described.


How would you invoke something like `rm * .bak` in such a shell? Do you have a separate command for globbing, such as:

  rm $(glob *.bak)


Yeah pretty much just that:

    rm @{g *.bak}
`g` returns a JSON array of file names and the @{} tells the shell to expand the array into parameters.

This way you don't need to worry about spaces in file names or other problems with escaping.


What if I wanted to delete the two files "@{g" and "*.bak}"?


You'd just put them in quotes as you already had done. In that regard the shell is designed to behave very similarly to Bash.


A more interesting comparison would be glibc vs other libc.

What are the significant differences?

Is "flexibility", for lack of a better word, really a desirable property for entering command line arguments?

Should parsing, cf. entering, arguments be simple or complex?

Are there better ways to pass arguments/parameters than on the commandline?

Besides config files, which themselves may introduce parsing complexities.

Consider passing "arguments" as environmental variables, e.g., as in daemontools, envdir. Variables can be read from "files" in a chroot directory.


I don't remember seeing this explicitly stated anywhere, but based on stuff I've read, I think there is loose / informal Unix convention that some command-line programs follow, about how they can be configured at runtime:

- command line options override environment variables, which in turn override config files (meaning applied to all three, for the same config setting).

I thought this was a good idea and can lead to flexibility, though also some complexity.

I think this might be a good scenario for the use of those three methods of configuration:

- config file settings for common values for options, or values that rarely change. So you set them to those values and change them rarely, when you need to make the new values permanent for a while.

- env. var settings for say, a session of work - in that session you want to override the config file value, and use it for (most of) the duration of your work session

- command-line options to override either of the previous two, say for just a command invocation or two (setting will automatically then revert to either of the two previous methods's values).

Interested to know if others have noticed this pattern (or think it is one) and any comments, since as I said, I don't remember seeing it written down anywhere.


> Variables can be read from "files" in a chroot directory.

Reading environment variables from process memory is less expensive (requiring only memory access) than reading files, which requires at least three syscalls (open, read, close) and thus multiple context switches, and, worst of all, disk access (unless you're putting the files in a tmpfs).

> Are there better ways to pass arguments/parameters than on the commandline?

If such a system were designed today, data would probably be strongly typed (passed as structured objects) rather than stringly typed (passed as text). In fact, to my understanding, this is what Powershell does. I think that the conceptual simplicity of using free-form text is a virtue for Unix more than a burden, but that's also an argument of taste.


"... which requires at least three syscalls..."

As does every program that reads from a config file.

One does not have to use files to set the variables of course. This just works well for long-running-programs, e.g. daemons.

"... unless you're putting files in a tmpfs)"

Always files are in mfs.

As for strongly typed data and passing data instead of text, I prefer k to Powershell. There is also a uniformity to APL built-in functions. The number of arguments is limited. PS is quite slow on startup and too verbose.

As for "free-form" sometimes rearranging arguments (as glibc is capable of) can cause more complexity than is warranted.

Consider a program like signify. Then consider gpg.

The best command line program "interfaces" IMO are the ones the fewest options and least possible variability in arguments. Best interface is no interface, etc.

One of the obvious benefits for UNIX programs of this nature is portability.


Another perspective (as a rhetorical question): Why does "command line argument(s)" have to be an array of strings, rather than a single string? Like, consider an alternative universe where C had `int main(const char* arg)`.


You mean, like this alternative universe? https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

Flipping this around, why does the 'command line' have to be a string (or a list of strings) that my program has to parse, rather than a ready-to-be-used data structure? Consider instead an alternative universe where the command-line arguments to your program are represented with something like a JSON object. Or better, a struct that the command-line interpreter can type-check.


Just the TOC on this made me laugh, haven't even read the rest yet


Mh. Is this a fair comparison? The unix part is massively simplified, and the windows parts goes into great and great details about the implementation of that entry point.

Bash command line splitting alone is nasty, let alone the 6 other shells. Every language under the sun has a dozen and twelve libraries to parse argv[] into something useful.

If you document every nook, cranny and special case of those libraries, the list for unix will dwarf the special cases of windows. In fact, I think bash command line parsing alone can dwarf the windows section, according to the man page.


The reason the Windows section is longer is because in Windows it's the application that splits the argv[]. An application receives the parameters as one long string and the application itself splits it into argv[] which means that there isn't a whole lot of consistency in how argv[] gets populated.

Where as on Linux / UNIX, parameters arrive at the application already split. Your point about the shells is a good one because at least on Linux you only need to learn the nuances of your preferred shell (though in my experience all the main ones seem to follow the same rules reliably) knowing that you wont have any weird issues for example with quotes being included or not included in argv[] in some applications but not others despite formatting the parameters in exactly the same way.

From a personal point of view, I do find working with parameters in Windows a highly frustrating affair when compared with Linux. However I appreciate the issue is down to inheritance rather than design and Microsoft cannot easily change something like this without breaking a thousand things in subtle and often unexpected ways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: