You seem to be assuming a *lot* about a development environment that was never s...

zrm · on Sept 13, 2016

> You seem to be assuming a lot about a development environment that was never specified.

The position you've staked out is "stop trying to enumerate badness." All I need is one good counterexample.

For example, Google Safe Browsing maintains a blacklist of malicious domains that clients can check. Are you suggesting that they should whitelist domains instead? What about subdomains? IP addresses?

How about email addresses for spam filtering?

You often don't have good (or any) information about whether a given instance of a thing is malicious or not. Blocking all such things also blocks the innocent things. In some contexts that's a cost you have to pay, but as a general rule it's not something you want.

> Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.

You have to define what your code will do, but what it should do is the original question.

> For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar.

That's just assuming the conclusion. You could also use a grammar that accepts any RFC3986-compliant URI that has a handler available for its scheme, and have the handler be responsible for malicious input.

> ...is off topic. This is about handling input to any code you write.

It's about where to handle and validate input. Most data is going to be passed through multiple independent applications on separate machines, through networks with multiple middleboxes, etc.

A general premise that you should block anything you don't recognize is flawed. It requires that everything would have to understand everything about everything, or discard it. An FTP client with a whitelist of files you can transfer is doing it wrong.

pdkl95 · on Sept 14, 2016

> A general premise that you should block anything you don't recognize is flawed.

Yes, it's imperfect. Sorry, but life is hard.

The alternative is not blocking some of the things you don't recognize. That's not merely attack surface, it's willfully giving attackers a window of opportunity.

"Hmm, this looks unusual. It doesn't look like anything I've seen before. We should let it pass."

> All I need is one good counterexample.

The caution against trying to enumerate badness is obviously not some sort of mathematical or logical law. This is heuristic based on several decades of experience. I don't give a damn if you can find a few places where the heuristic doesn't apply; history shows what has worked and what hasn't.

> spam

Not a security concern. This is about properly handling input data, not the admin or user policy of what happens to properly formatted data after it is successfully recognized (i.e. as an email, possibly with an attachment).

The same goes for "safe browsing". Which site to visit is the domain of admin policy or user request. The parsing of the data should be whitelisted by a defined grammar (which may not be a w3c/whatwg grammar).

> You often don't have good (or any) information about whether a given instance of a thing is malicious or not.

Correct. Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail. Thank you for making my point for me.

Again, what we do know is what the software you're writing can handle. You seem to be advocating that we should accept data when it is known that it isn't handled properly. That's choosing to have at best a bug, at worst a security hole.

> have the handler be responsible for malicious input.

I'm really not concerned with your implementation details, though I do strongly recommend formally recognizing your input up front, because scattering the parsing around in different modules is extremely difficult to verify. It may be annoying to use a parser generator like yacc/bison, but the do allow you to prove that your input is a valid grammar.

If you want to pass the handling off to another module that may support other URL schemes - that also properly rejects anything it cannot handle - then write that into your grammar. As I've said all along, this is about strongly defining what your accept. If your code accept many different URL schema, then define it that way and validate the input against that definition.

If you haven't, you should really watch the talk I linked to initially.

zrm · on Sept 14, 2016

> Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail.

Unfortunately it's also why "enumerating goodness" is destined to fail. It's like the instructions for securing a computer: Dig a deep hole, put the computer in the hole, throw a live grenade in the hole, and now the computer is secure.

It's not enough to be secure, it also has to do the thing the user wants it to do. If it doesn't then the users (and developers who are after the users' patronage) will figure out how to make it happen anyway, which means bypassing your validation one way or another.

The flaw is in the assumption that immediately denying something the user wants is more secure than not immediately denying something the user doesn't want, which is flawed because of the second order effects.

If the thing you put in front of Alice to prevent Mallory from sending bad things also prevents Bob from sending good things, Alice and Bob are going regard your validation as adversarial and get together to devise an alternate encoding for arbitrary data that will pass your validation. Which information theory says they can always do at the cost of some space inefficiency. But as soon as Alice starts accepting unvalidated data using that encoding method, it allows Mallory to send malicious data to Alice that will pass your validation.

The solution is to do validation as soon as possible but no sooner. If you don't know what something is but it could be valid, you have to let it go so that the thing downstream which actually does know can make that determination itself.

I mean I get how we got here. Some things that should be doing validation don't do it well or at all, and then people try to put validation in front of them to make up for it. But if you do that and reject something the endpoint wants (or you're e.g. stubbornly enforcing an older protocol version) then new endpoint code is going to pay the cost of encoding around you, which is expensive in efficiency and complexity and deprives you of the ability to do the validation you are equipped to do.

If the downstream code isn't doing validation correctly then it has to be fixed where it is.

> If you haven't, you should really watch the talk I linked to initially.

I don't think anything I'm saying is strongly in conflict with it. You can validate against a grammar and still treat part of the data as a black box. Obvious example is validating an IP packet without assuming anything about the payload structure.