TL;DR: "Because you aren't using our snazzy new product; try it today!"
Also:
This is primarily because of the most obvious problem:
We are still using relational SQL databases.
Were we to use NoSQL database systems such as MongoDB
or CouchDB, none of these attacks would ever happen
Yeah, and motorbikes are safer than cars because the third and fourth wheels never fall off...
When combining software components, you have to gateway between them -- this is done by escaping inputs and validating outputs. It's often non-trivial to grok where the junctures between components are, and how to properly connect them. This requires thought, experience, and diligence.
Why it's so common is that you can easily connect components incorrectly, still have it "work" enough to do a demo, and, there is lot of pressure to put things that appear to work into production without auditing them.
Addendum: Shish2k observes this article claims NoSQL databases would somehow solve the problem, rather than seeing NoSQL as just a different kind of component with it's own escaping/validation needs. Classic -- someone writing on this very topic is blinded to the phenomenon he discusses; the article itself demonstrates why we have SQL injections.
Something about that sounds wrong. It is as though you are suggesting that we use in-band signaling with a bunch of notch filters to ensure that Cap'n Crunch whistles cannot be used to get free calls. The right answer is out-of-band signaling -- in other words, not constructing queries / command strings / etc. from user inputs.
Major SQL databases all support prepared statements; this seems like a far more robust approach than trying to sanitize inputs.
Any sane programming community would string me up by my thumbs for doing this, and rightly so.
Yet, do the exact same thing in SQL and it's just business as usual. Occasionally people talk about prepared statements and parameterized queries and such, but usually people just talk about escaping. As if the problem is just that you need more backslashes, not that you're glomming strings together at runtime and then calling eval() on them.
It's so bizarre. People would think you're crazy for doing it in one environment, but doing it in a different environment where the exact same problems are present is just fine! I can't fathom how we got to this place.
I should have used properly "construct input" (as suggested by dragon-writer) rather than "escape". If possible, you'd use query parameters or some other reusable and verified mechanism to encode user content. If you're embedding a data stream of one type within another, you still have to manage boundaries to encode content properly.
I'd note for database languages, query construction is often two phase, where you build your query template based on selection/filtering needs, and then, use parameter substitution with user-provided content. I've seen experienced programmers do both at the same time and resort to manual escaping of user content rather than seeing them as distinct phases.
I think it's about managing the movement of information between contexts... one process' data is another process' code. The most important context switch happening between untrusted inputs and the rest of your system.
I apologize for deleting my other comment; I was trying to shorten it.
You make good points, and I'm trying to figure out exactly how your points and mine hook together.
I think what it ultimately comes down to is distinguishing between data and code. More specifically, data can contain code, but code should never contain (foreign) data. So escaping in general is fine, but escaping in this context, where you're building a string of code, is no good.
That, I think, is why I'm comfortable with e.g. the escaping done to a string when emitting JSON, but not with escaping a string to put it into an SQL query.
With parameterized queries, you're still ultimately passing the code and data over to the database over some sort of stream (assuming an out-of-process database server), but hopefully that stream is designed as data containing both the parameterized SQL code and the parameters, rather than just SQL code with escaped parameters.
Doesn't the database server just escape the data in the parametrized query? So if you used the database's escape functions directly wouldn't it be the same thing? I guess if you are arguing that developers are more likely to mess it up than the database server then I see your point.
Then when you run the prepared query, it can simply call the closure, putting the right value in the datastructure without escaping anything, and continuing on from there.
Of course, the realities of it are more complicated than that for optimization reasons, but that's the basis of it.
> Occasionally people talk about prepared statements and parameterized queries and such, but usually people just talk about escaping.
Occasionally? For many years "preferred prepared statements and use user input, sanitized (via escaping and/or more involved means) or not, only with a very special need that where prepared statements don't do what you need, and then be as restrictive as possible in what you accept" has been what I've heard everywhere.
Lucky you! I've not seen it be so common. Obviously the smart guys who properly understand what's going on are all about parameterized queries, but I still see a ton of other people talking about escaping.
To double-check and make sure I'm not just being biased, I did a Google search for "php mysql tutorial" and read through the top five results. Not a single one mentioned parameterized queries. One of the tutorials didn't mention using foreign data in queries at all, just hard-coded query strings. Of the other four, two escaped parameters and two just put everything in single quotes and called it a day.
The PHP community has an unusually reckless disregard for proper programming practices. Poisonously bad tutorial sites like w3schools are not helping either.
Java, Perl, Ruby, Python, C#, NodeJS and virtually other language or framework strongly encourages the use of placeholders. PHP's community stands alone in stubborn opposition to this despite having facilities to do this.
PHP is also unusually hostile to frameworks and ORM-like database layers even when these would solve a myriad of problems without imposing too greatly.
Honestly, on the whole PHP programmers in general are like hunter gatherers who see no use for anything other than the most basic of tools. There are exceptions, but I figure these people usually graduate to other communities where the best practices they follow have already been fully embraced.
I completely agree. However, the PHP community is still a huge part of the server-side web community, so they're largely representative even if nobody else follows their lead.
Representative of themselves, really. The point being that even if PHP is different from everything else, PHP is big enough that their wacky antics are still significant, and PHP alone is enough to say that escaping parameters in SQL is still widespread.
I have no problem acknowledging that the dominant advice in the domain of PHP MySQL tutorials has been to use escaping, and that that's a serious problem.
>Any sane programming community would string me up by my thumbs for doing this, and rightly so.
How do you explain shell scripting then? Bash and co. have been 'industry best practice' for decades. They feature all these problems of in-band communication and (attempts at) escaping and nobody seems to consider it a problem.
Shell scripting is another insane domain where stuff that would be crazy everywhere else is suddenly accepted. Fortunately, they at least recognize that it's usually a bad idea to feed potentially hostile data to local scripts.
> When combining software components, you have to gateway between them -- this is done by escaping inputs and validating outputs.
Well, no, its done by constructing outputs from inputs in one module, and validating inputs in the next module (you can validate outputs, as well, but that's less critical.) "Escaping" is one method of constructing outputs from inputs, but its often an unsafe and undesirable method.
Agreed! In general this is a problem of defining and abiding by valid inputs and valid outputs of a system.
SQL statements (unparameterized) have a poor distinction between what is an element of the SQL language and what is permitted data input: where that boundary can be crossed, injection attacks become possible.
Unrealistic deadlines and sales reps promising features to clients is a large part of the problem. Get the changes in, test the usability, and ship it. Get it out to the client pronto. We have an earnings report due by X date and this will help the bottom line.
It's not that devs and DBAs don't care, it's that in many cases (not all) they aren't given the time to carefully test the system to see how it might be abused.
And also, many companies view security testing as an afterthought or an audit requirement. If they do not have to do it (time is money) they won't. And when they are forced to do it, they rush through it.
In some ways this is a philosophical fight between a group that wants to ship software and make a profit and a group that wants to design and test systems to make sure they are correct. Suits vs technologists.
How the hell do deadlines have anything to do with SQL injection? Writing
$stmt = $dbh->prepare("SELECT * FROM users WHERE USERNAME = $username AND PASSWORD = $password");
$stmt->execute();
vs
$stmt = $dbh->prepare("SELECT * FROM users WHERE USERNAME = ? AND PASSWORD = ?");
$stmt->execute(array($username, $password));
adds no time or effort. SQL Injection happens because devs haven't been taught to do the right thing or are too dumb to remember to do the right thing. Both conditions are easy to fix: teach the first group, fire the second.
I understand this. When security is not part of the culture of the company, and management just wants to ship software, devs won't have time to test. MS used to be this way (back before XP SP2). They stopped and made security a focus and allowed devs to work on securing the software. Today, they are much better off because of that. Not all companies think about security. Many only think about the bottom line.
Because there are developers and project managers out there who still don't understand it very well, or (dare I say) don't care enough to understand it. There will always be people who will just get the work done to a level that will reasonably satisfy their bosses, so that they can take their paycheck, and go home.
This is not software engineering alone, by the way.
SQL injection is so prevalent because developers overwhelmingly think about their data in terms of its representation, rather than its meaning.
User input and SQL statements are two completely separate things, but it just so happens that they can both be represented in memory using the same layout. Hence many developers stop thinking in meaningful terms like "user input" and "SQL query", and instead start thinking in terms of implementation details, like "strings".
"Systems Hungarian" has the same cause; Hungarian notation means prefixing variable names with useful information about the contents, for example "dAge" could be the difference between two ages and "cUsers" could be a count of how many users there are. When this practice became enforced at, for example Microsoft, it quickly degraded into the meaningless "Systems Hungarian" which just gives unhelpful implementation details, eg. "iAge" and "iUsers" to indicate that they're both integers.
Widespread use of Systems Hungarian, enforced by clueless management, lead to many developers hating Hungarian notation completely, despite never knowing that it has nothing to do with machine representations.
SQL is hard to get rid of injections and here's why I think it is so.
How would an ideal injection-free application look? I'd imagine a set of unmutable precompiled SQL statements (the code) each controlled by a set of parameters (the data.) No gluing of statements from strings at runtime and the parameters are obviously passed out-of band. You can't forget to escape user-provided data because in this setup the SQL code and the user data never mix, so escaping is not necessary.
Unfortunately is not feasible to use the above design with SQL. A simple filter with a parameter that can be unspecified (as in "don't care") would require 2 different precompiled statements ("select ... from foo" and "select ... from foo where param=<placeholder>".) The number grows exponentially with the complexity of the filter, so the runtime construction of SQL statements is inevitable.
You can go about it in a smarter or a dumber way but the danger remains.
This paper: http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-... describes a general way to deal with injection attacks (not just SQL injections), namely adding the syntax of the "guest" language (such as SQL) to the "host" language (such as Java, e.g.
SQL q = <| SELECT id FROM users WHERE name = ${userName} AND password = ${password} |>;
The <|...|> construct switches from Java to SQL, while ${...} switches from SQL to Java. The desugaring process that translates the Java+SQL to plain Java ensures that escaping is done at the right places. This approach is based on SDF2, a language for defining grammars that allows context-free grammars to be combined in almost arbitrary ways.
It's not hard. It's a solved problem. People just choose to remain ignorant of these best practices.
Binding data to a query after the fact is a reliable way of escaping. Composing a prepared statement is not hard, and creating these at runtime is not a big deal.
Honestly, for most of your daily work you should be using an ORM of some kind where this is already done for you. The only reason you should be writing direct SQL is when you're doing things the ORM doesn't natively support.
Ruby on Rails is currently 100% injection free as far as anyone can tell, and if it isn't there'd be an emergency patch issued right away. There have been situations in the past where this was not the case, but these bugs were quickly addressed after being discovered. Most other frameworks (Django, etc.) are in the same category.
How about we stop calling everything "strings" (and "ints" too, while we're at it)?
UserInput readFromPost(HttpParam param) {
// Read param from POST here
}
Collection<SqlRow> query(SqlQuery q) {
// Send query to DB here
}
UserInput name = readFromPost('name');
Collection<SqlRow> result = query(
sqlConcat('INSERT INTO users (name) VALUES (',
stringToSql(userInputToString(name)),
')')
SqlInt<10> userId = sqlRowLookup(result, 'id')
sendToBrowser(htmlConcat('You are user number ',
sqlIntToString(userId)))
Clearly we can get better APIs than this, but it's not as difficult as you make out to program in a safe and sane way.
Good point, but this would require composable SQL in the typesystem, right? Maybe libraries like SQLAlchemy come close to that (or maybe are already there?) but look how long it took.
Type systems tell you when your code is wrong. You don't have to do any SQL manipulation in your type system, you just make sure that SQL and non-SQL data have different types, and carry on as normal. If your code is joining functions together in an unsafe way, the type system will reject it and you have to go back and change it. If it's safe, they type system accepts it and then the code can be compiled into some untyped* language (assembly, Javascript, JVM bytecode, whatever)
* By 'untyped' I mean that the type-system that you were using is no longer enforcing anything. There may be another type system in the compiler target.
It would actually be possible to use a type system to compose SQL; we could use this to guarantee there are no syntax errors. For example we could have a types "SqlTable", "SqlQueryType", "SqlWhereCondition", "SqlComparisonOperator", etc.
However, we don't need to do anything nearly so elaborate to stop SQL injection. We just need a type "SqlQuery", since that will be instantly unusable by all string concatenation functions. We then make a concatenation function for SqlQuery values and a "stringToSqlQuery" function (or ".toSqlQuery" method, if you prefer) which converts strings to SqlQuery values by escaping them.
This way, we've turned SQL escaping into a type coercion, so we can only pass the type checker by escaping every string we put in our queries. Note also that it solves the double-escaping problem: since escaped strings have a different type to unescaped strings, we can't send them back through the escape function; ie. "stringToSqlQuery(stringToSqlQuery(foo))" is a type error.
Sure, probably true. But I would assume most people who write insecure code do so because they don't know how to do it the right way, not because "must be secure" wasn't in the requirements doc.
If security (operationalized properly) was part of the requirements against which it was evaluated prior to release, insecure code wouldn't be released.
("Must be secure" is a much higher level requirement than anything that is testable, but a high level requirement is meaningless except to the extent its operationalized into lower-level requirements that are testable -- or analytically provable, but that's even harder.)
Isn't this more about using strings to build executing code from untrusted input?
This kind of stuff always seems to surface when one language is embedded in another and this tends to be database query languages and one of the most popular is SQL, but it's not limited to it, just look at javascript injection from parsing JSON using eval.
I am not seeing injection going away until we stop sending bits of code between runtimes as strings, which will be hard to avoid so long as databases and applications use different languages. It's a lot of work to abstract one language into the other (See LINQ to SQL), where as string concatenation is understandably the easy way out.
Ideally we could all agree on a good binary data format(something like Go GOBs) and a AST structure to serialize to it then we would have the common building blocks to send dynamically built code between runtimes/languages without using strings.
> Isn't this more about using strings to build executing code from untrusted input?
No.
Its about lack of care in building executing code from untrusted input. If you took data from the outside world, converted it into x86 machine code, and sent to it another system, you could still have the same problem if you didn't handle the input properly.
SQL-by-string-interpolation/concatenation is an easy target because its a common enough mistake that the same style knowledge of how to exploit the holes in it can be used against lots of targets, not because the transfer format from the poorly-built system generating the query to the critical backend system handling the query uses strings.
My point is it's much easier to lack care when using strings. If for example SQL where only exposed as an AST object model to the client language injection would be much harder to accidentally allow. Also you could have another layer of your app actually examine the AST for security issues (no DML statements allowed) or certain tables restricted etc.
We actually use a component in our app the builds an AST from SQL which can then be verified, this is not trivial. Also LINQ builds an AST which is then transformed to SQL statements in LINQ to SQL, this AST can also be examined before executed and prevents injection.
Also:
Yeah, and motorbikes are safer than cars because the third and fourth wheels never fall off...