This discussion has ABSOLUTELY NOTHING to do with SQL injection. That is a completely different issue, as you may indeed want and need to put all of War and Peace into a char field, but there is no valid use case for putting all of War and Peace into a filename.
Furthermore, it was quite traditional for Unix filenames to be written exactly has you have done so above. (Although dashes would have been more commonly used than underscores.) And, in fact, it was considered fairly anti-social to put spaces, carriage returns, etc., in your filenames, as Unix utilities could not cope with them.
The fact that for decades Unix utilities could only cope with filenames written as you have above, proves my point, not yours. Even now, when Unix utilities are finally getting to the point where they might be able to cope with any combination of characters in a filename (other than null or "/"), this has come at great extra complexity and requiring much extra care on the part of the programmer. Which again proves my point, not yours.
Now, of course, whenever anyone asserts that something requires too much care, someone else will quite wrongly assert that programmers should not be encouraged to be lazy. That real programmers should work for a living. This is, of course, absurd. The best programmers will always be lazy... in the right way. They will want to solve a problem that accomplishes the most, in the least amount of time, with the least amount of code. I.e., they will want to be productive. People who make quips about lazy programmers are doing a huge disservice to the world by promulgating a world with less productive programmers who consequently accomplish less. Which means that, for all you know, cancer won't be cured when otherwise it might have been.
As to your proposed solution of having the shell escape special characters, rather than having the kernel disallow them, that could probably work too, at the cost of a consistency. I.e., when you look at a filename in the shell, it's going to look different from how it will look in your GUI browser. E.g., the shell is going to display your filename in the manner that you have done, only it probably won't do as good a job as you would at encoding your intent into a limited character set.
And, if you are going to take this approach, it doesn't have to be by allowing special characters in filenames. It can be done either by encoding special characters in non-special characters, or by having a display name that is different from the filename.
Additionally, I find it amusing when people assert that having limitations in filenames is crippling system functionality, and yet they don't make the same assertion about identifier names in programming languages. Oh no! Having a limited set of characters in variable names has crippled my ability to program... NOT!
I am not in any way against file naming conventions, i am against enforcement of convention, specific for one use case (programming) in general-purpose system.
When shell script is written for some specific task, you can relay on convention and receive all productivity benefits even without kernel enforcement. If somebody creates files with carriage return in source tree because kernel doesn't stop him, problem is social, not technological.
> I find it amusing when people assert that having limitations in filenames is crippling system functionality, and yet they don't make the same assertion about identifier names in programming languages.
Programming language has narrower usage field than operating system. Naming variable "Мой любимый щеночек (01.02 12:34).jpg" (my favorite puppy) is absurd. Having file with such name is perfectly reasonable.
What is actually my primary point - file names in general are not program internals. They are part of user data and should be treated like that.
> As to your proposed solution of having the shell escape special characters, rather than having the kernel disallow them
Not really what i meant. I was saying, that program (be it shell script, or application calling 'system()'), that intends to work on arbitrary, user provided file names, won't benefit from kernel-enforced limitations. "process $FOO" won't become protected from misuse and exploits if special character will be forbidden, application will still have to check for "bar; rm -rf .", and checking and rejecting that is not harder than replacing it with "./bar\;\ rm\ -rf\ .". It's just calling escape_file_name instead of validate_file_name.
So:
1) any productivity benefits, provided by kernel file name limitations can be acquired by convention. (What UNIX world is doing)
2) such limits won't make anything safer. Building shell command by blind concatenation of user provided data will still be unsafe. If user is trusted - case 1
3) files are used not only by programmers. Imposing such limits will either degrade user experience, or lead to display name !== actual file name, leading to indirections and kludges much worse than touch -- "$FOO"
If somebody creates files with carriage return in source tree because kernel doesn't stop him, problem is social, not technological.
Your position boggles and dismays me. I have seen so many heinous bugs that appear only intermittently, and are nigh impossible to track down, due to this kind of issue. The problem absolutely positively is not social. It is technical. The only social thing about it is that people persist on taking the wrong side on this issue.
As to having different rules for filenames in different places, that is just nuts. Programs should not be fragile, and shouldn't have subtle edge cases. Having software work that way has all sorts of downsides and hidden costs. E.g., people need to remember a lot more. More documentation is needed. Things go wrong when they didn't have to. All of this costs time and money and helps to sap enthusiasm as people track down chimeras they shouldn't have had to.
Furthermore, one of the prime use cases for scripting is by system administrators, and such scripts need to handle all files. The stories of sysadmin scripts that have run afoul of files with strange filenames is legendary.
Regarding your example with "process $FOO": that's completely a red herring. You might as well assert, "We can't solve everything, so we should solve nothing." In this particular case, we were talking about the problems caused by filenames that are hard to deal with in a scripting environment, not about programs directed by user input. The first problem is easily solvable once and for all, while the second problem is less so and will always require care. Just because some things require great care does not mean that we should make all things require great care.
I just can't fathom that there are still people who actually argue for a world that fosters subtle bugs and lack of robustness. It is downright wrong, and it may someday be our undoing. Quite literally.
Spaces (parenthesis, semicolons, bangs etc) in file names are not subtle edge cases if you consider system as tool for reaching user's goals. Programs have to process file names with spaces not because kernel aesthetics, but because users want and need files with normal, readable names.
Actually, I would happily agree to ban \n as edge case - it's useless for end user and a readable file names separator is needed for scripts (like \0, which is forbidden because it is useless for users and extremely inconvenient to work with in C).
What if users want rich text in their filenames? Why shouldn't they have the ability to do that? And certainly they want slashes in their filenames! But Unix doesn't give them that either. Horrors!
What people want most of all is reliable, robust software. Features that don't work right are worse than no feature at all. What you fail to consider is that every feature has a cost. In this case, the cost was WAY too high. If this cost is to be paid, then it should have been paid in a lower-cost manner.
Contra to what you say, I'm perfectly sure that users would have dealt file with having more limited filenames. In fact that did quite fine with 8.3 filenames for many years. I must concur, however, that those were more limiting than humans should be forced to adapt to.
This being said, I have nothing against giving people the ability to have all of these things in the display name for a file, if it is deemed that the extra flexibility is worth the trouble. This extra flexibility just shouldn't be in the unique identifier for a file. There are perfectly good ways to provide this capability in a manner that has far fewer costs.
Alternatively, I'm not opposed to adopting the attitude of the kernel hackers and shifting the burden onto the shells to generate such meta-character-free identifies from richer display names, but if that was the way it was to be, it would then have been essential that a standard library for generating such unique identifiers from display names have been created, and that the shells uniformly use this library.
There are perfectly good ways to provide this capability in a manner that has far fewer costs.
Not really. It's either some specialized tools (throwing away all environment uniformity benefits) or another layer of indirection (display name->real name->inode), with it's share of bugs (and having two close, often equal, but different identifiers won't make programming any less error prone).
shells to generate such meta-character-free identifies from richer display names
You still need to pass rich display names to shell, so old problems are still there, and, on top of that, consistent mapping of display names to real ones is required.
In this case, the cost was WAY too high
-- "$FOO" instead of $FOO, and ls -1 instead of ls?
(Not accounting for \n here, because in that case i agree on it's abysmal benefit/cost ratio and banning)
Furthermore, it was quite traditional for Unix filenames to be written exactly has you have done so above. (Although dashes would have been more commonly used than underscores.) And, in fact, it was considered fairly anti-social to put spaces, carriage returns, etc., in your filenames, as Unix utilities could not cope with them.
The fact that for decades Unix utilities could only cope with filenames written as you have above, proves my point, not yours. Even now, when Unix utilities are finally getting to the point where they might be able to cope with any combination of characters in a filename (other than null or "/"), this has come at great extra complexity and requiring much extra care on the part of the programmer. Which again proves my point, not yours.
Now, of course, whenever anyone asserts that something requires too much care, someone else will quite wrongly assert that programmers should not be encouraged to be lazy. That real programmers should work for a living. This is, of course, absurd. The best programmers will always be lazy... in the right way. They will want to solve a problem that accomplishes the most, in the least amount of time, with the least amount of code. I.e., they will want to be productive. People who make quips about lazy programmers are doing a huge disservice to the world by promulgating a world with less productive programmers who consequently accomplish less. Which means that, for all you know, cancer won't be cured when otherwise it might have been.
As to your proposed solution of having the shell escape special characters, rather than having the kernel disallow them, that could probably work too, at the cost of a consistency. I.e., when you look at a filename in the shell, it's going to look different from how it will look in your GUI browser. E.g., the shell is going to display your filename in the manner that you have done, only it probably won't do as good a job as you would at encoding your intent into a limited character set.
And, if you are going to take this approach, it doesn't have to be by allowing special characters in filenames. It can be done either by encoding special characters in non-special characters, or by having a display name that is different from the filename.
Additionally, I find it amusing when people assert that having limitations in filenames is crippling system functionality, and yet they don't make the same assertion about identifier names in programming languages. Oh no! Having a limited set of characters in variable names has crippled my ability to program... NOT!
In summary, history has proved me right.