Hacker News new | past | comments | ask | show | jobs | submit login

Everyone does this "wrong" because every app does this differently. The core reason windows command line options aren't (or at least shouldn't be) used to pass complicated data is that way back when DOS simply provided a single command line string to the executed program and let it parse it itself. So no two command line parsers are the same. The glitch with escaping here is merely one symptom of a broader problem.

Unix got this right by forcing the shell to provide the kernel a pre-parsed list of strings, so the only insanity the tool integrator needs to understand is the shell's quoting syntax. Which is still insane. But it's only insane in one particular way.




On Unix, most tools don't really need to use the shell at all; it's enough to treat argument lists as lists internally and pass them to exec or posix_spawn (which of course, unlike the Windows _exec and _spawn, aren't broken for arguments with spaces!).

However, at minimum it is still useful to shell-escape arguments when displaying them to the user (for ease of copy+paste), so it's unfortunate that many languages don't have any standard library function to do this - including C on POSIX, and Python before version 3.


But in comparasion with MS systems, escaping strings for unix shells is very simple - just prepend and append an ' and change every ' into '\''


Alas, no. Bourne shell syntax allows for double quotes with variable interpolation and some other fancy syntax (including backslash escaping of literal double-quote characters), and a single-quote syntax for "raw" strings with no fancy syntax INCLUDING backslash escaping.

So your rule won't work. You can't single-quote a string that itself contains single quotes, which makes for some fun when you have arbitrary strings (file names are the big frustration) that need to be substituted into a parseable command line.

But like I said above: Bourne syntax[1] is only one kind of insanity, which is still much better than the DOS/Windows world of a separate parser for every app.

[1] We shall not speak of the C shell here.


Their rule does work. The single quotes are transformed into '\''.

The first ' closes the current single quotes string. The \' adds a single quote character outside a quoted string (and outside single quotes you can use backslash escapes). Finally, the last ' reopens the single quoted string that we closed before.


A method I've used: replace all occurrences of ' with '"'"' and then surround resulting string with '

It's a bit ugly but it works.


pipes.quote() is in Python's stdlib since forever. Though it was not documented until it became shlex.quote() in Python 3.


_exec an _spawn are not Windows; they are functions in the MS Visual C redistributable run-time.

(Well, they are also in a system DLL called MSVCRT.DLL. That is an internal library which is undocumented and considered by Microsoft to be off-limits to applications.)


Unix got something right in that you can unambiguously pass a list of separate strings to launched processes. However, it does nothing to ensure unambiguous meaning of those strings.

This is for example why you should avoid giving your files such cute names as '-rf'.


Unix, in fact, does something for this.

Firstly, its IEEE standard (1003.1 or "POSIX") specifies the -- convention for separating option arguments from non-option arguments. The tiny handful of utilities like "echo" which do not implement it are also documented that way.

Secondly, Unix provides the POSIX standard getopt C library function, and getopts command. Programs and scripts which use these standard functions for processing options will implicitly support the -- convention.

Developers of new command line programs can ignore the documentation and standard functions, of course, developing their own non-conforming parsing from scratch. But at least users have something to point to if they report that as a problem: look, your program isn't supporting --, meaning that you ignored both the POSIX standard convention and the library function which enforces it.


> This is for example why you should avoid giving your files such cute names as '-rf'.

The kernel should ban these names. I'm a big fan of dwheeler's proposal for fixing filenames: see http://www.dwheeler.com/essays/fixing-unix-linux-filenames.h...

These is no god damn reason why a filename should be able to contain, say, LF, DEL, or BEL. None whatsoever.


Yes there is. You want the filesystem to be flexible. If the shell doesn't like those characters, use a different shell that doesn't care. It's brain-dead to create a filesystem that prevents flexibility in user interfaces.


Flexibility is only a good thing if the benefits outweigh the costs. I insist that there are no legitimate (i.e., no better option) use cases for control characters in file names. The filesystem being "flexible" is not a good thing if flexibility causes real problems.


> These is no god damn reason why a filename should be able to contain, say, LF, DEL, or BEL. None whatsoever.

OK you want ASCII 0x07 to be disallowed. Should a filename be allowed to contain "㜇"? (U+3707)


That's not a problem because the UTF-8 encoding of U+3707 will absolutely not contain any USASCII control characters, or any special shell or filesystem characters. It will all be bytes in the range 0x80-0xFF.


There are other encodings than UTF-8 though. Which is kind of my point. If you have your file system set to UTF-16 (doesn't NTFS do this?) then 0x07 will be present.


I also believe that filesystems should require that all filenames be fully normalized UTF-8. I don't think the benefits (slight, IMHO) of allowing filenames to be arbitrary byte strings outweigh the costs of code complexity and security problems.


That's not how UTF-8 works.


It is how UTF-16 (NTFS) does though.


That doesn't count. Windows doesn't allow the 16-bit word 0x0007 to appear in filenames.


What? '-rf' is a specific set of flags for a specific program. You can't ban all possible flags for all programs in file names.


The operating system could address it by having a separate argument list and option list at the kernel level, creating an unambiguous interface for calling a program, giving it a list op options and non-option arguments.

Ambiguity would remain in how a given shell parses input to determine what are options and what are arguments: but this would at least be out of the control of individual programs. Notably, the shell would be the tool which parses the -- convention. Programs wouldn't see the -- delimiter which separates options from non-options, so it would be impossible for a program to neglect to implement support for --.


Yes, programs are free to interpret arguments any way they want. (See dd(1).) But in practice, almost all programs interpret a leading dash in an argument word to mean "here be options". By banning filenames with leading dashes, we close a large number of security holes at minimal cost. Of course it's not a total solution, but from a pragmatic perspective, it's the right thing to do, because it goes a long way toward solving a real problem.


Close what security holes? If someone isn't escaping input they are still screwed if you ban dashes.

It's like suggesting we don't allow sql to store quotes so we can use quotes to enclose data.


It's harm reduction. Yes, everyone should be escaping input. Yes, everyone should be using "./.foo" instead of just ".foo". But people don't, and they're not going to start. If we ban leading dashes, we stop these bugs from turning into security vulnerabilities.

Your stance is like being against ASLR because developers just shouldn't have buffer overflow vulnerabilities in their code.


What does it mean to say that an argument's meaning could be unamibigious regardless of the program it is passed into ? that's a logical impossibility

Still, it would be right and proper if Unix programs a little type-safety in their arguments, for example by requiring that ALL arguments be flags, as in this hypothetical smart_rm: "/bin/smart_rm -rf --pattern foo/bar"


You can escape file names that look like options with '--'.


This depends on the application recognizing the -- convention and also depends on having all the little scripts in your system remembering to use the --.


Even if the OS kernel provided a process launching API with separated options and arguments, that would not remove the need for the -- syntax to remove the ambiguity at the shell level, and hence your need to use that in scripts.

It would remove the problem of programs all being required to implement --.


> the only insanity the tool integrator needs to understand is the shell's quoting syntax

Only if the designer is using command-based functions like the ISO C system or POSIX popen; not when forking and exec'ing programs.


MS does provide an alternate startup routine that parses the arguments before entering main().




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: