Can we please stop trying to enumerate badness[1]? When parsing input it is possible to define the set of valid input, not all possible invalid inputs.
Also, anybody accepting input from an untrusted source (such as anything from a network or the user) that isn't verifying the data with a formal recognizer is doing it wrong[2]. Instead of writing another weird machine, guarantee that the input is valid with a parser generator (or whatever) recognize the input and drop anything even slightly invalid.
> Can we please stop trying to enumerate badness[1]?
No. Because we don't know what goodness looks like.
The world can be separated into good, bad and unknown. If you classify anything unknown as bad then anything new is DOA. People aren't going to add new things to the whitelist before they're popular which means they can never become popular. It's stasis.
But people do that anyway, which makes the good guys have to adopt the MO of the bad guys and make the new thing look like the existing thing. So everything uses HTTP and everything looks the same.
Which means everything is more complicated than it needs to be, because it has to pretend to be something else, which creates more attack surface.
And which means the whitelist is no longer meaningful because allow-http becomes equivalent to allow-everything.
It's like buying a car that can only drive to home and work on the theory that it will be safer. It will be at first, except that you can no longer go anywhere but home and work. But when enough people do that then everything (including the bad stuff) has to move to where people are allowed to go. Which puts you right back where you started except that now you have two problems.
You're writing the parser, so you define the set of acceptable input.
> The world can be separated into good, bad and unknown
The data your software receives as input can be separated into valid input that your software will correctly interpret, or invalid input that is either and error or an attack.
There shouldn't ever be any "unknown" input, as that would imply you don't know how your software parses its input. As the ccc talk in my previous [2] explains, this may be true if recognition of input is scattered across your software and thus hard to understand as a complete grammar. Thus the recommendation to put it all in one place using a parser generator (or whatever).
> If you classify anything unknown as bad then anything new is DOA.
Anything unknown is by definition not properly supported by the software you're writing.
> Anything unknown is by definition not properly supported by the software you're writing.
This seems to be where you're going wrong. There is no god-mode where you can see the whole universe and perfectly predict everything that will happen in the future.
Your code has to do something when it gets a URI for a scheme that didn't exist when you wrote your code. The handler for that URI is third party code. Your code can either pass the URI to the registered handler or not.
And if the answer is "not" then it will be prohibitively difficult for a new URI scheme (or what have you) to gain traction. Which means every new thing has to be shoehorned into HTTP and HTTP becomes an ever larger and more complicated attack surface.
You seem to be assuming a lot about a development environment that was never specified. This about writing software that handles input from an extern, potentially hostile source. Parsing URLs that were supplied by the user is one example of that.
> Your code has to do something when it gets a URI
Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.
For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar. Any input outside that is invalid and dropped while dispatching any necessary error handling.
> third party code
...is off topic. This is about handling input to any code you write. Any 3rd parties also need to define what they accept as input.
> it will be prohibitively difficult for a new URI scheme (or what have you) to gain traction.
That is a separate problem that will always exist. You're trying to prematurely optimize in an insecure way. Worrying about potential future problems doesn't justify writing bad code today that passes hostile data without verification.
If you know that a URL scheme - or collection of schemes - will be handled properly, then define it as valid and pass it along. If it isn't handled or you don't know if it will be handled properly, define it as invalid and drop it. Doing otherwise is choosing to add a security hole. The same goes for every other byte of data received from a hostile source.
> You seem to be assuming a lot about a development environment that was never specified.
The position you've staked out is "stop trying to enumerate badness." All I need is one good counterexample.
For example, Google Safe Browsing maintains a blacklist of malicious domains that clients can check. Are you suggesting that they should whitelist domains instead? What about subdomains? IP addresses?
How about email addresses for spam filtering?
You often don't have good (or any) information about whether a given instance of a thing is malicious or not. Blocking all such things also blocks the innocent things. In some contexts that's a cost you have to pay, but as a general rule it's not something you want.
> Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.
You have to define what your code will do, but what it should do is the original question.
> For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar.
That's just assuming the conclusion. You could also use a grammar that accepts any RFC3986-compliant URI that has a handler available for its scheme, and have the handler be responsible for malicious input.
> ...is off topic. This is about handling input to any code you write.
It's about where to handle and validate input. Most data is going to be passed through multiple independent applications on separate machines, through networks with multiple middleboxes, etc.
A general premise that you should block anything you don't recognize is flawed. It requires that everything would have to understand everything about everything, or discard it. An FTP client with a whitelist of files you can transfer is doing it wrong.
> A general premise that you should block anything you don't recognize is flawed.
Yes, it's imperfect. Sorry, but life is hard.
The alternative is not blocking some of the things you don't recognize. That's not merely attack surface, it's willfully giving attackers a window of opportunity.
"Hmm, this looks unusual. It doesn't look like anything I've seen before. We should let it pass."
> All I need is one good counterexample.
The caution against trying to enumerate badness is obviously not some sort of mathematical or logical law. This is heuristic based on several decades of experience. I don't give a damn if you can find a few places where the heuristic doesn't apply; history shows what has worked and what hasn't.
> spam
Not a security concern. This is about properly handling input data, not the admin or user policy of what happens to properly formatted data after it is successfully recognized (i.e. as an email, possibly with an attachment).
The same goes for "safe browsing". Which site to visit is the domain of admin policy or user request. The parsing of the data should be whitelisted by a defined grammar (which may not be a w3c/whatwg grammar).
> You often don't have good (or any) information about whether a given instance of a thing is malicious or not.
Correct. Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail. Thank you for making my point for me.
Again, what we do know is what the software you're writing can handle. You seem to be advocating that we should accept data when it is known that it isn't handled properly. That's choosing to have at best a bug, at worst a security hole.
> have the handler be responsible for malicious input.
I'm really not concerned with your implementation details, though I do strongly recommend formally recognizing your input up front, because scattering the parsing around in different modules is extremely difficult to verify. It may be annoying to use a parser generator like yacc/bison, but the do allow you to prove that your input is a valid grammar.
If you want to pass the handling off to another module that may support other URL schemes - that also properly rejects anything it cannot handle - then write that into your grammar. As I've said all along, this is about strongly defining what your accept. If your code accept many different URL schema, then define it that way and validate the input against that definition.
If you haven't, you should really watch the talk I linked to initially.
> Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail.
Unfortunately it's also why "enumerating goodness" is destined to fail. It's like the instructions for securing a computer: Dig a deep hole, put the computer in the hole, throw a live grenade in the hole, and now the computer is secure.
It's not enough to be secure, it also has to do the thing the user wants it to do. If it doesn't then the users (and developers who are after the users' patronage) will figure out how to make it happen anyway, which means bypassing your validation one way or another.
The flaw is in the assumption that immediately denying something the user wants is more secure than not immediately denying something the user doesn't want, which is flawed because of the second order effects.
If the thing you put in front of Alice to prevent Mallory from sending bad things also prevents Bob from sending good things, Alice and Bob are going regard your validation as adversarial and get together to devise an alternate encoding for arbitrary data that will pass your validation. Which information theory says they can always do at the cost of some space inefficiency. But as soon as Alice starts accepting unvalidated data using that encoding method, it allows Mallory to send malicious data to Alice that will pass your validation.
The solution is to do validation as soon as possible but no sooner. If you don't know what something is but it could be valid, you have to let it go so that the thing downstream which actually does know can make that determination itself.
I mean I get how we got here. Some things that should be doing validation don't do it well or at all, and then people try to put validation in front of them to make up for it. But if you do that and reject something the endpoint wants (or you're e.g. stubbornly enforcing an older protocol version) then new endpoint code is going to pay the cost of encoding around you, which is expensive in efficiency and complexity and deprives you of the ability to do the validation you are equipped to do.
If the downstream code isn't doing validation correctly then it has to be fixed where it is.
> If you haven't, you should really watch the talk I linked to initially.
I don't think anything I'm saying is strongly in conflict with it. You can validate against a grammar and still treat part of the data as a black box. Obvious example is validating an IP packet without assuming anything about the payload structure.
If a new URL scheme shows up that actually makes sense to be used with sites like these, then these sites will have to be updated anyway to support that scheme, at which point you can easily whitelist it.
> If a new URL scheme shows up that actually makes sense to be used with sites like these, then these sites will have to be updated anyway to support that scheme, at which point you can easily whitelist it.
If you aren't using a whitelist, and the URL handling is relying on the underlying platform and not application code, than a new URL scheme takes no changes to the application code.
That's precisely the danger that the whitelist is supposed to guard against. Just because the underlying platform can handle a URL type doesn't mean that it's safe for your software to accept that URL type. Using a blacklist instead of a whitelist means that what should be a safe update of the OS your software runs on can suddenly cause a security vulnerability in your app, even if you properly blacklisted every potentially-vulnerable URL scheme at the time your software was written.
> That's precisely the danger that the whitelist is supposed to guard against.
Be that as it may, the suggestion that there would be a need to update the code independent of the whitelist, and that the whitelist could be updated at the same time, is incorrect. The need to update is a cost of the choice to use a whitelist (maybe a justifiable cost, but certainly a cost.)
No, it's the cost of choosing to support a new URL scheme. You have to validate your app to make sure it makes sense to allow the use of the new URL scheme anyway, updating a whitelist should be pretty trivial. And you only pay the cost if a new URL scheme shows up that you actually want to support. Meanwhile the blacklist approach not only exposes you to security vulnerabilities, but imposes a cost every time the underlying platform adds support for a new URL type because now you have to update your blacklist to block it.
> You have to validate your app to make sure it makes sense to allow the use of the new URL scheme anyway
No, you don't, necessarily. A URL is a means of locating a resource; if your app makes sense for the kinds of resources and representations it handles independently of their origin, you don't need to validate anything about a URL scheme.
(The security problem with some file:// URLs actually is a completely different problem, it is not one that there is a question of whether the application makes sense with that scheme -- which it does.)
> Meanwhile the blacklist approach not only exposes you to security vulnerabilities, but imposes a cost every time the underlying platform adds support for a new URL type because now you have to update your blacklist to block it.
No, you only have to update the blacklist if it should be blocked. In many applications. Whether this is a cost that is paid more often than whitelist driven updates depends on whether in the particular application it is more likely that a new URL scheme will be allowed or prohibited.
> No, you don't, necessarily. A URL is a means of locating a resource; if your app makes sense for the kinds of resources and representations it handles independently of their origin, you don't need to validate anything about a URL scheme.
Sure you do. You have to make sure the URL scheme doesn't allow access to data that should otherwise be prohibited. For example, I probably shouldn't be able to pass "ftp://localhost/etc/passwd" to your app. It's not just file:// that has the potential to be problematic.
> Whether this is a cost that is paid more often than whitelist driven updates depends on whether in the particular application it is more likely that a new URL scheme will be allowed or prohibited.
New URL schemes that become widely used on the internet are pretty rare. Usually new URL schemes are restricted to specific narrow use-cases, e.g. magnet: URIs being used for BitTorrent. But there are plenty of niche URL schemes that may or may not be supported by the underlying OS that don't really make sense for you to support (for example, does your markdown converter really want to handle dict: URIs?). The blacklist approach means you need to make sure you know of every single possible URL scheme that may possibly be supported, and evaluate every single one of them to determine if they should be blacklisted. The whitelist approach lets you only allow the schemes that you've determined are safe.
> The blacklist approach means you need to make sure you know of every single possible URL scheme that may possibly be supported, and evaluate every single one of them to determine if they should be blacklisted.
The whitelist approach requires the same thing, it's just that the consequences of getting it wrong are different.
If you don't blacklist something that you should then you could let through a security vulnerability.
If you don't whitelist something that you should then the developers of that software have to devise a way to disguise their software as something that is already whitelisted or be destroyed, which is even worse.
Because doing that is inefficient and complicated, which is the recipe for security vulnerabilities, and then you can't even blacklist it if you know you don't need it because it's specifically designed to parse as something on the whitelist.
You're really stretching here. If your markdown converter only accepts http and https, so what? That's all it was ever tested with, there's no reason to expect it to support some other niche URL scheme. In fact, in this entire discussion, I have yet to even think of another URL scheme that you would expect to be widely-supported by tools like this. With the whitelist approach, you don't need to consider all of the various URL schemes, you just need to say "is there anything besides http and https that I should support?", to which the easy answer is "probably not".
It seems you're answering your own question. Why are there no other popular URL schemes? Because too many things don't support generic schemes so any new ones are DOA.
Here's an example. Suppose I want to do content-addressible storage. I could create a new URI scheme like hash://[content hash] and then make some client software to register that scheme with the OS, and in theory lots of applications using the operating system's URI fetch API could seamlessly pick up support for that URI scheme. But not if too many applications do the thing you recommend.
So instead I write software to use http://127.1.0.1/[content hash] and then run a webserver on 127.1.0.1 that will fetch the data using the content hash and return it via HTTP. But then we're +1 entire webserver full of attack surface.
guarantee that the input is valid with a parser generator
OK, that works really well... until you learn how much non-RFC-specified behavior is built in to web browsers. Simply building a parser to the RFC will leave you wide open to all sorts of nastiness!
The is_safe_url() internal function in Django is a bit of a historical dive into things we've learned about how browsers interpret (or, arguably, misinterpret) various types of oddball URLs:
I never said anything about limiting the parser to what's defined in an RFC. The acceptable input to "quirks mode" is just another (non-RFC) grammar, which still needs to be defined and validated.
Then I do wish you luck, but I don't think you'll ever be able to produce a suitably complete grammar since parts of it will require knowledge of undocumented proprietary internals of Internet Explorer.
Hence we scrape along doing our best with what we can figure out from observing behavior and collecting bug reports. But even with that, is_safe_url() is one of the most prone-to-security-issues functions in Django's codebase.
Hopefully the URL spec (https://url.spec.whatwg.org) is helpful here in finding other potentially unsafe behaviours that browsers have, though given much of it seems to be dealing with the fact that urllib.urlparse doesn't match what browsers do in many, many ways it's probably of limited help. (Nobody really implements it yet; it's just an attempt at standardising rough intersection semantics of what browsers currently do. Eventually, however, it should suffice, once legacy browsers eventually die.)
WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do. Anything that is done in common by all of them gets implemented no problem. It's when they all differ that the editor(s) tries to come up with more reasoned algorithms.
> WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do.
I thought WHATWG standards are formed by starting with what the four major browser vendors agree to do, not what they currently do (though usually at least one has an implementation before something gets proposed for standardization.)
I do wonder; is there any browser that is actually full-RFC-specced? I checked a few (the mainstream desktop ones, but also links2 etc.), but so far they all seem to have glue to fix historical behavior.
Pretty much no, because it'd be practically useless. And I don't think anyone has the willingness to spend time or money on something that will essentially just be a toy.
There's been plenty of work on moving the standards so that there are actually implementations of them, instead of them being practically useless at best and misleading at worst (given doing input validation based on a spec that nobody actually implements is just outright dangerous), with HTML 5 and much of CSS 2.1 leading that charge (though CSS 2.1 still has massive blackholes, notably table layout remains largely undefined, though that is finally being worked on).
Also, anybody accepting input from an untrusted source (such as anything from a network or the user) that isn't verifying the data with a formal recognizer is doing it wrong[2]. Instead of writing another weird machine, guarantee that the input is valid with a parser generator (or whatever) recognize the input and drop anything even slightly invalid.
[1] http://www.ranum.com/security/computer_security/editorials/d...
[2] https://media.ccc.de/v/28c3-4763-en-the_science_of_insecurit...