Perl has a special "taint" mode (Ruby does, too, I think, but I don't know about the differences or similarties), where any data coming from outside the program - environment variables, network connections, files, etc. - is considered "tainted", i.e. treated as bad, until you validate it using regular expressions (or something along those lines). I have to admit I never tried it, but it sounds like a good idea, assuming it works as advertised, when dealing with potentially malicious input or paranoia.
> but ruby doesn't track external strings as dirty like that.
Yes, it does. E.g:
x = STDIN.gets
puts x.tainted?
y = "hello "+x
puts y.tainted?
z = "foo"
puts z.tainted?
The two first will return "true", and the last will print "false". If you set the safe-level appropriately some methods will be disallowed for x and y above (e.g. eval()), though I would not trust that as comprehensive, but it's a lot better than nothing.
There's simple ways to "escape" user input (as in, ensure the whole input string is interpreted as a single argument to this program) in ways that ensure you can't do simple &&'s or ;'s and execute a totally different command. But the point of the article is even if it's properly escaped, users can still do malicious things when input is passed to lots of standard UNIX utilities.
What I wish, I wish there was a flag in unicode to declare characters as 'unsafe user input' so that system utilizes and databases can recognize unsafe user input and barf on it.
I don't think you even understand the concept of in-band and out-of-band, that's not a function of the encoding. And I've written protocols aplenty in the days when not everything ran on top of HTTP, high speed serial links, with and without virtual circuits (so mux-demux) and a whole slew of others.
Just to make sure you are on the same page as the rest of us here: in-band and out-of-band is a way to distinguish sending meta information about the data stream through the same channel as the original data. You need an escape mechanism for that, so control characters and such.
Out-of-band signalling indicates that all meta information about the data stream travels through a different (virtual) circuit, in which case there can never be confusion about whether a given chunk is data or meta info.
This is a point that should have been made in the author's article. As it is, it presents a poorly defined problem without offering solutions (other than "be afraid").
In other words, even if it's not obvious, doing this creates a security vulnerability: