Hacker News new | past | comments | ask | show | jobs | submit login
If your code accepts URIs as input, filter out “file://” (steve.fi)
370 points by stevekemp on Sept 12, 2016 | hide | past | favorite | 157 comments



Wrong way around: only allow http:// and https:// (and generally filtering out anything thats not letters, numbers, slash or dot is probably a good idea. Remove any sequences of more than one slash or dot.


Exactly.

Whitelist only trusted schemes, do not wait to blacklist untrusted.

I wrote the Go HTML sanitizer: https://github.com/microcosm-cc/bluemonday and have a rule for user generated (untrusted) content that basically does whitelist just the things that one can trust: https://github.com/microcosm-cc/bluemonday/blob/master/helpe...

That states that URIs must be:

1. Parseable

2. Relative

3. Or one of: mailto http https

4. And that I will add rel="nofollow" to external links, and additionally I'll add "rel="noopener" if the link has a target="_blank" attribute

Oh, and I do not trust Data URIs either.


Might want to add tel to the whitelist. It works in roughly the same way as mailto but interfaces with telephone apps instead of email clients.


This is the default user-generated policy, others are able to tweak and adjust using policy rules, i.e:

    p.AllowURLSchemes("tel")
I chose conservative and safe defaults, not everyone wishes to whitelist telephone links.


Can we please stop trying to enumerate badness[1]? When parsing input it is possible to define the set of valid input, not all possible invalid inputs.

Also, anybody accepting input from an untrusted source (such as anything from a network or the user) that isn't verifying the data with a formal recognizer is doing it wrong[2]. Instead of writing another weird machine, guarantee that the input is valid with a parser generator (or whatever) recognize the input and drop anything even slightly invalid.

[1] http://www.ranum.com/security/computer_security/editorials/d...

[2] https://media.ccc.de/v/28c3-4763-en-the_science_of_insecurit...


> Can we please stop trying to enumerate badness[1]?

No. Because we don't know what goodness looks like.

The world can be separated into good, bad and unknown. If you classify anything unknown as bad then anything new is DOA. People aren't going to add new things to the whitelist before they're popular which means they can never become popular. It's stasis.

But people do that anyway, which makes the good guys have to adopt the MO of the bad guys and make the new thing look like the existing thing. So everything uses HTTP and everything looks the same.

Which means everything is more complicated than it needs to be, because it has to pretend to be something else, which creates more attack surface.

And which means the whitelist is no longer meaningful because allow-http becomes equivalent to allow-everything.

It's like buying a car that can only drive to home and work on the theory that it will be safer. It will be at first, except that you can no longer go anywhere but home and work. But when enough people do that then everything (including the bad stuff) has to move to where people are allowed to go. Which puts you right back where you started except that now you have two problems.


> Because we don't know what goodness looks like.

You're writing the parser, so you define the set of acceptable input.

> The world can be separated into good, bad and unknown

The data your software receives as input can be separated into valid input that your software will correctly interpret, or invalid input that is either and error or an attack.

There shouldn't ever be any "unknown" input, as that would imply you don't know how your software parses its input. As the ccc talk in my previous [2] explains, this may be true if recognition of input is scattered across your software and thus hard to understand as a complete grammar. Thus the recommendation to put it all in one place using a parser generator (or whatever).

> If you classify anything unknown as bad then anything new is DOA.

Anything unknown is by definition not properly supported by the software you're writing.


> Anything unknown is by definition not properly supported by the software you're writing.

This seems to be where you're going wrong. There is no god-mode where you can see the whole universe and perfectly predict everything that will happen in the future.

Your code has to do something when it gets a URI for a scheme that didn't exist when you wrote your code. The handler for that URI is third party code. Your code can either pass the URI to the registered handler or not.

And if the answer is "not" then it will be prohibitively difficult for a new URI scheme (or what have you) to gain traction. Which means every new thing has to be shoehorned into HTTP and HTTP becomes an ever larger and more complicated attack surface.


You seem to be assuming a lot about a development environment that was never specified. This about writing software that handles input from an extern, potentially hostile source. Parsing URLs that were supplied by the user is one example of that.

> Your code has to do something when it gets a URI

Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.

For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar. Any input outside that is invalid and dropped while dispatching any necessary error handling.

> third party code

...is off topic. This is about handling input to any code you write. Any 3rd parties also need to define what they accept as input.

> it will be prohibitively difficult for a new URI scheme (or what have you) to gain traction.

That is a separate problem that will always exist. You're trying to prematurely optimize in an insecure way. Worrying about potential future problems doesn't justify writing bad code today that passes hostile data without verification.

If you know that a URL scheme - or collection of schemes - will be handled properly, then define it as valid and pass it along. If it isn't handled or you don't know if it will be handled properly, define it as invalid and drop it. Doing otherwise is choosing to add a security hole. The same goes for every other byte of data received from a hostile source.


> You seem to be assuming a lot about a development environment that was never specified.

The position you've staked out is "stop trying to enumerate badness." All I need is one good counterexample.

For example, Google Safe Browsing maintains a blacklist of malicious domains that clients can check. Are you suggesting that they should whitelist domains instead? What about subdomains? IP addresses?

How about email addresses for spam filtering?

You often don't have good (or any) information about whether a given instance of a thing is malicious or not. Blocking all such things also blocks the innocent things. In some contexts that's a cost you have to pay, but as a general rule it's not something you want.

> Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.

You have to define what your code will do, but what it should do is the original question.

> For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar.

That's just assuming the conclusion. You could also use a grammar that accepts any RFC3986-compliant URI that has a handler available for its scheme, and have the handler be responsible for malicious input.

> ...is off topic. This is about handling input to any code you write.

It's about where to handle and validate input. Most data is going to be passed through multiple independent applications on separate machines, through networks with multiple middleboxes, etc.

A general premise that you should block anything you don't recognize is flawed. It requires that everything would have to understand everything about everything, or discard it. An FTP client with a whitelist of files you can transfer is doing it wrong.


> A general premise that you should block anything you don't recognize is flawed.

Yes, it's imperfect. Sorry, but life is hard.

The alternative is not blocking some of the things you don't recognize. That's not merely attack surface, it's willfully giving attackers a window of opportunity.

"Hmm, this looks unusual. It doesn't look like anything I've seen before. We should let it pass."

> All I need is one good counterexample.

The caution against trying to enumerate badness is obviously not some sort of mathematical or logical law. This is heuristic based on several decades of experience. I don't give a damn if you can find a few places where the heuristic doesn't apply; history shows what has worked and what hasn't.

> spam

Not a security concern. This is about properly handling input data, not the admin or user policy of what happens to properly formatted data after it is successfully recognized (i.e. as an email, possibly with an attachment).

The same goes for "safe browsing". Which site to visit is the domain of admin policy or user request. The parsing of the data should be whitelisted by a defined grammar (which may not be a w3c/whatwg grammar).

> You often don't have good (or any) information about whether a given instance of a thing is malicious or not.

Correct. Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail. Thank you for making my point for me.

Again, what we do know is what the software you're writing can handle. You seem to be advocating that we should accept data when it is known that it isn't handled properly. That's choosing to have at best a bug, at worst a security hole.

> have the handler be responsible for malicious input.

I'm really not concerned with your implementation details, though I do strongly recommend formally recognizing your input up front, because scattering the parsing around in different modules is extremely difficult to verify. It may be annoying to use a parser generator like yacc/bison, but the do allow you to prove that your input is a valid grammar.

If you want to pass the handling off to another module that may support other URL schemes - that also properly rejects anything it cannot handle - then write that into your grammar. As I've said all along, this is about strongly defining what your accept. If your code accept many different URL schema, then define it that way and validate the input against that definition.

If you haven't, you should really watch the talk I linked to initially.


> Which is why trying to maintaining a blacklist of bad things ("enumerating badness") is destined to fail.

Unfortunately it's also why "enumerating goodness" is destined to fail. It's like the instructions for securing a computer: Dig a deep hole, put the computer in the hole, throw a live grenade in the hole, and now the computer is secure.

It's not enough to be secure, it also has to do the thing the user wants it to do. If it doesn't then the users (and developers who are after the users' patronage) will figure out how to make it happen anyway, which means bypassing your validation one way or another.

The flaw is in the assumption that immediately denying something the user wants is more secure than not immediately denying something the user doesn't want, which is flawed because of the second order effects.

If the thing you put in front of Alice to prevent Mallory from sending bad things also prevents Bob from sending good things, Alice and Bob are going regard your validation as adversarial and get together to devise an alternate encoding for arbitrary data that will pass your validation. Which information theory says they can always do at the cost of some space inefficiency. But as soon as Alice starts accepting unvalidated data using that encoding method, it allows Mallory to send malicious data to Alice that will pass your validation.

The solution is to do validation as soon as possible but no sooner. If you don't know what something is but it could be valid, you have to let it go so that the thing downstream which actually does know can make that determination itself.

I mean I get how we got here. Some things that should be doing validation don't do it well or at all, and then people try to put validation in front of them to make up for it. But if you do that and reject something the endpoint wants (or you're e.g. stubbornly enforcing an older protocol version) then new endpoint code is going to pay the cost of encoding around you, which is expensive in efficiency and complexity and deprives you of the ability to do the validation you are equipped to do.

If the downstream code isn't doing validation correctly then it has to be fixed where it is.

> If you haven't, you should really watch the talk I linked to initially.

I don't think anything I'm saying is strongly in conflict with it. You can validate against a grammar and still treat part of the data as a black box. Obvious example is validating an IP packet without assuming anything about the payload structure.


If a new URL scheme shows up that actually makes sense to be used with sites like these, then these sites will have to be updated anyway to support that scheme, at which point you can easily whitelist it.


> If a new URL scheme shows up that actually makes sense to be used with sites like these, then these sites will have to be updated anyway to support that scheme, at which point you can easily whitelist it.

If you aren't using a whitelist, and the URL handling is relying on the underlying platform and not application code, than a new URL scheme takes no changes to the application code.


That's precisely the danger that the whitelist is supposed to guard against. Just because the underlying platform can handle a URL type doesn't mean that it's safe for your software to accept that URL type. Using a blacklist instead of a whitelist means that what should be a safe update of the OS your software runs on can suddenly cause a security vulnerability in your app, even if you properly blacklisted every potentially-vulnerable URL scheme at the time your software was written.


> That's precisely the danger that the whitelist is supposed to guard against.

Be that as it may, the suggestion that there would be a need to update the code independent of the whitelist, and that the whitelist could be updated at the same time, is incorrect. The need to update is a cost of the choice to use a whitelist (maybe a justifiable cost, but certainly a cost.)


No, it's the cost of choosing to support a new URL scheme. You have to validate your app to make sure it makes sense to allow the use of the new URL scheme anyway, updating a whitelist should be pretty trivial. And you only pay the cost if a new URL scheme shows up that you actually want to support. Meanwhile the blacklist approach not only exposes you to security vulnerabilities, but imposes a cost every time the underlying platform adds support for a new URL type because now you have to update your blacklist to block it.


> You have to validate your app to make sure it makes sense to allow the use of the new URL scheme anyway

No, you don't, necessarily. A URL is a means of locating a resource; if your app makes sense for the kinds of resources and representations it handles independently of their origin, you don't need to validate anything about a URL scheme.

(The security problem with some file:// URLs actually is a completely different problem, it is not one that there is a question of whether the application makes sense with that scheme -- which it does.)

> Meanwhile the blacklist approach not only exposes you to security vulnerabilities, but imposes a cost every time the underlying platform adds support for a new URL type because now you have to update your blacklist to block it.

No, you only have to update the blacklist if it should be blocked. In many applications. Whether this is a cost that is paid more often than whitelist driven updates depends on whether in the particular application it is more likely that a new URL scheme will be allowed or prohibited.


> No, you don't, necessarily. A URL is a means of locating a resource; if your app makes sense for the kinds of resources and representations it handles independently of their origin, you don't need to validate anything about a URL scheme.

Sure you do. You have to make sure the URL scheme doesn't allow access to data that should otherwise be prohibited. For example, I probably shouldn't be able to pass "ftp://localhost/etc/passwd" to your app. It's not just file:// that has the potential to be problematic.

> Whether this is a cost that is paid more often than whitelist driven updates depends on whether in the particular application it is more likely that a new URL scheme will be allowed or prohibited.

New URL schemes that become widely used on the internet are pretty rare. Usually new URL schemes are restricted to specific narrow use-cases, e.g. magnet: URIs being used for BitTorrent. But there are plenty of niche URL schemes that may or may not be supported by the underlying OS that don't really make sense for you to support (for example, does your markdown converter really want to handle dict: URIs?). The blacklist approach means you need to make sure you know of every single possible URL scheme that may possibly be supported, and evaluate every single one of them to determine if they should be blacklisted. The whitelist approach lets you only allow the schemes that you've determined are safe.


> The blacklist approach means you need to make sure you know of every single possible URL scheme that may possibly be supported, and evaluate every single one of them to determine if they should be blacklisted.

The whitelist approach requires the same thing, it's just that the consequences of getting it wrong are different.

If you don't blacklist something that you should then you could let through a security vulnerability.

If you don't whitelist something that you should then the developers of that software have to devise a way to disguise their software as something that is already whitelisted or be destroyed, which is even worse.

Because doing that is inefficient and complicated, which is the recipe for security vulnerabilities, and then you can't even blacklist it if you know you don't need it because it's specifically designed to parse as something on the whitelist.


You're really stretching here. If your markdown converter only accepts http and https, so what? That's all it was ever tested with, there's no reason to expect it to support some other niche URL scheme. In fact, in this entire discussion, I have yet to even think of another URL scheme that you would expect to be widely-supported by tools like this. With the whitelist approach, you don't need to consider all of the various URL schemes, you just need to say "is there anything besides http and https that I should support?", to which the easy answer is "probably not".


It seems you're answering your own question. Why are there no other popular URL schemes? Because too many things don't support generic schemes so any new ones are DOA.

Here's an example. Suppose I want to do content-addressible storage. I could create a new URI scheme like hash://[content hash] and then make some client software to register that scheme with the OS, and in theory lots of applications using the operating system's URI fetch API could seamlessly pick up support for that URI scheme. But not if too many applications do the thing you recommend.

So instead I write software to use http://127.1.0.1/[content hash] and then run a webserver on 127.1.0.1 that will fetch the data using the content hash and return it via HTTP. But then we're +1 entire webserver full of attack surface.


guarantee that the input is valid with a parser generator

OK, that works really well... until you learn how much non-RFC-specified behavior is built in to web browsers. Simply building a parser to the RFC will leave you wide open to all sorts of nastiness!

The is_safe_url() internal function in Django is a bit of a historical dive into things we've learned about how browsers interpret (or, arguably, misinterpret) various types of oddball URLs:

https://github.com/django/django/blob/master/django/utils/ht...


> non-RFC-specified behavior

I never said anything about limiting the parser to what's defined in an RFC. The acceptable input to "quirks mode" is just another (non-RFC) grammar, which still needs to be defined and validated.


Then I do wish you luck, but I don't think you'll ever be able to produce a suitably complete grammar since parts of it will require knowledge of undocumented proprietary internals of Internet Explorer.

Hence we scrape along doing our best with what we can figure out from observing behavior and collecting bug reports. But even with that, is_safe_url() is one of the most prone-to-security-issues functions in Django's codebase.


Hopefully the URL spec (https://url.spec.whatwg.org) is helpful here in finding other potentially unsafe behaviours that browsers have, though given much of it seems to be dealing with the fact that urllib.urlparse doesn't match what browsers do in many, many ways it's probably of limited help. (Nobody really implements it yet; it's just an attempt at standardising rough intersection semantics of what browsers currently do. Eventually, however, it should suffice, once legacy browsers eventually die.)


That URL spec is just "this is what chrome does, everyone repeat that".

They’re unwilling to modify anything, or standardize anything, but just want to cement the current piece of shit that URL parsing it for the future.


WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do. Anything that is done in common by all of them gets implemented no problem. It's when they all differ that the editor(s) tries to come up with more reasoned algorithms.


> WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do.

I thought WHATWG standards are formed by starting with what the four major browser vendors agree to do, not what they currently do (though usually at least one has an implementation before something gets proposed for standardization.)


Which is not really ideal.

Standards aren’t about documenting what is, but about defining what will be.


Given Chrome Canary currently fails a large number of tests, it seems like it's hardly just "this is what chrome does, everyone repeat that".


Because the standard was changed to clean a bit of the stuff Google did up.

But WHATWG only changes standards to include more, never to include less.


I do wonder; is there any browser that is actually full-RFC-specced? I checked a few (the mainstream desktop ones, but also links2 etc.), but so far they all seem to have glue to fix historical behavior.


Pretty much no, because it'd be practically useless. And I don't think anyone has the willingness to spend time or money on something that will essentially just be a toy.

There's been plenty of work on moving the standards so that there are actually implementations of them, instead of them being practically useless at best and misleading at worst (given doing input validation based on a spec that nobody actually implements is just outright dangerous), with HTML 5 and much of CSS 2.1 leading that charge (though CSS 2.1 still has massive blackholes, notably table layout remains largely undefined, though that is finally being worked on).


Exception: passwords. Do not enumerate goodness when accepting a new password.


Password checking could be so easy ...

if(password.size() < 24) "Your password sucks! Choose a longer one"

Update each year to stay ahead of faster computer speeds.


Substitute password with passphrase, mention hard limit (24 characters) and suggest using a sentence instead of word(s)


I agree with the approach. However, specific examples of different badnesses are useful for testing the final product.


I.e. move enumeration of badness from code into test-cases.


> filtering out anything thats not letters, numbers, slash or dot is probably a good idea.

This is highly non-trivial once you realize that the world speaks more than ASCII and things like http://www.xn--n3h.net exist.


>This is highly non-trivial once you realize that the world speaks more than ASCII and things like http://www.xn--n3h.net exist.

I was under the impression that requests to and from the server still used ASCII?

That is, the server would see a host header as this:

  Host: www.xn--n3h.net
And not as this:

  Host: www.[snowman icon].net
Anything else is a question of URL-encoding, which if not used would raise interesting bugs with space characters, let alone anything more exotic like snowmen.

Edit for completeness: in my server logs, the GET request for a /[snowman icon] URL is url encoded to

  GET /%E2%98%83 HTTP/1.1


Right but how does the user submit it and what do you put in the href?


If a user copied the URL from the address bar, it will be correctly percent-encoded already.

You can put the same percent-encoded URL in the href attribute of a hyperlink. A properly encoded URL will not contain any character that requires escaping in an HTML context.

When a user clicks on that link, the browser will navigate to the percent-encoded URL but display the snowman icon in the address bar. If the user copies it, it will transparently turn back into the percent-encoded URL. All modern browsers do this.


I just tried doing that with a few domain names containing an umlaut (äöü) and every single time that letter was copied into the clipboard (even though behind the scenes at the request level it would have been encoded). This is what I expect as a regular user. They don't want to deal with encoded, unreadable URLs.


I tried with http://њњњ.срб , which Firefox copies correctly, but Chromium copies as http://xn--g2aaa.xn--90a3ac/ — not very useful.

This is a different mechanism to the path part, where both Firefox and Chromium give https://ru.wikipedia.org/wiki/%D0%A0%D0%BE%D1%81%D1%81%D0%B8... rather than the readable https://ru.wikipedia.org/wiki/Россия


The two methods are punycode [1] and percent encoding [2].

[1] https://en.wikipedia.org/wiki/Punycode

[2] https://en.wikipedia.org/wiki/Percent-encoding


If you select the entire URL bar, you get the encoded form. If you leave off the protocol, or just the h, you get it unencoded.


If I'm a user writing an URL, then I will write it as it appears in the URL bar. That means you must be able to accept URLs that contain unicode.

Keep in mind that unicode isn't just for emojii. Plenty of languages use characters that are not in ascii.


I've had an international domain since 2006, and the sad truth is they still aren't widely supported 10 years later (the fuckyeahmarkdown website being a case in point). I don't think people are deliberately filtering out those characters - they just aren't aware that such names are even possible.

In the beginning I used to file bug reports whenever I encountered websites that couldn't handle my domain, but I eventually resigned myself to the fact that most people just don't care. Nowadays I don't even bother trying the unicode most of the time, and just use the punycode version instead.


>Wrong way around: only allow http:// and https://

For myself, it's a subtle change in developer thinking - "what should I allow" vs "what should I exclude" - that's paid off massively over the years.



It should be pointed out that while this was once accepted as gospel, it has been coming under a lot of fire lately. HTML, once arguably the flagship of this principle and its greatest success (I say "arguably" because you can also argue TCP), no longer works this way. HTML5 specifies how bad input should be handled, and if you accept that "how to process nominally bad input" as the "real" standard, HTML is now strict in what it accepts. It's just that what it is strictly accepting appears quite flexible.

I'm not a big believer in it myself; "liberal in what you accept" and "comprehensible for security audits" are not quite directly opposed, but certainly work against each other fairly hard. There's a time and a place for Postel's principle, but I consider it more an exception for exceptional circumstances rather than the first thing you reach for.


> HTML5 specifies how bad input should be handled, and if you accept that "how to process nominally bad input" as the "real" standard, HTML is now strict in what it accepts.

HTML5 is a shining example of "be liberal in what you accept", and its improved documentation of how to handle bad input (note that bad input is still permitted!) greatly expands HTML's "be conservative in what you send". I think HTML5 is a perfect example of the Robustness Principle.


The "bad input" is, arguably, no longer bad input. The standard has been redefined to strictly specify what to do with that "bad" input, and if you don't handle it exactly as the standard specifies, it won't do what you "want" it to do.

That's not "being liberal in what you accept". Being liberal in what you expect is what we had before HTML 5, where the standard specified the "happy case" and the browsers were all "liberal in what they expect", in different ways. I am not stretching any definitions here or making anything up, because "liberal in what you accept" behaviors in the real world demonstrably work this way; everybody is liberal in different ways. It can hardly be otherwise; it isn't "being liberal in what you accept" if you accept exactly what the standard permits, after all. When liberality is permitted, what happens in practice is that out-of-spec input is handled in whatever the most convenient way for the local handler is, in the absence of any other considerations (such as deliberately trying to be compatible with the quirky internal details of the competition). Browsers leaked a lot about their internal differences if you observed how they tended to handle out-of-spec input. Thus a standard like HTML5 that clearly specifies how to handle all cases now is fundamentally not "liberal in what it accepts" anymore.

Instead, it is a rare, if not unique, example of a standard that has been rigidly specified after a couple of decades of seeing exactly how humans messed up the original standard. It is, nevertheless, now quite precise about what to do about the HTML you encounter. You aren't allowed to be "liberal", you're told exactly what to do.


> The "bad input" is, arguably, no longer bad input.

What? Yes it is! Defined behavior for invalid markup doesn't make that markup valid.

HTML5 doesn't refuse to accept anything that HTML 4 accepted. Defining behavior for invalid markup does not even impact "be liberal in what you accept", the scope of what is accepted hasn't changed. It affects "be conservative in what you send", in particular it more closely matches that half of the principle.


> HTML5 doesn't refuse to accept anything that HTML 4 accepted.

It does. It doesn't accept NET syntax, i.e., `p/This is contents of a p elements/`. (No browser ever supported this, but because HTML 4 is defined to be an SGML application and it's DTD allows NET syntax to be used, it is theoretically conforming HTML 4.)


Ah good point, thanks for the correction.


(There's also another load of SGML bits of syntax that browsers have never supported which HTML5 doesn't support. Indeed, HTML 4 has a whole section of such things: http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.3)


I think web browsers are a better example of it. The HTML parsing/DOM tree system usually is pretty forgiving about missing/malformed tags, but still always returns a result rendered as if the HTML had been written to spec.


Didn't we decide in the aftermath of IE6 that this was a bad idea, and that we should be strict in both what we accept and what we emit?


No, we decided that what was important was interoperable implementations: it doesn't matter how you achieve that goal. What's needed is specs that define how to handle all input (it doesn't matter what the spec says: it can define how to handle every single last possible case as HTML5 does, or it can define a subset of inputs to trigger some fatal error handling as XML1.0 does) and sufficient test suites that implementers catch bugs in their code before it ships (and the web potentially starts relying on their quirks).

The problem with IE6 was the fact that it wasn't interoperable (in many cases, every implementation was conforming according to the spec, and there were frequently differences in behaviour in valid input that the spec didn't fully define) and the fact that it had lots of proprietary extensions (and being strict and disallowing any extensions makes it hard to extend formats in general in a non-proprietary way; one option is strict versioning but then you end up with a million if statements all over the implementation to alter behaviour depending on the version of the content).

Some of the worst issues with IE that took the longest for other browsers to match were things like table layout: IE quite closely matched NN4 having invested a lot in reverse-engineering that as the web depended on the NN4 behaviour in places; Gecko had rewritten all the table layout code from NN and didn't match its behaviour having been written according to the specs which scarcely define how to layout any table even today.


No, be strict, fail fast, and report the errors. Robustness is not achieved by muddling through on a misinterpretation, it is achieved by working toward correctness.


I think "fail fast" can work well in a closed and controlled system but when accepting input from many other parties it's not as practical or desirable.


>https://en.m.wikipedia.org/wiki/Robustness_principle

You need to be careful of where you place the emphasis on that, though:

Be -liberal- in what you accept.

vs

Be liberal in what you -accept-.


Right - you accept URIs. That's fairly liberal.

> Be conservative in what you do

However, you only handle specific schemes and ignore the rest.


Yeah, apologies, I was being pretty petty.

My hopefully-better-expressed point is that it's easy to interpret the robustness principle in different ways, some of which lead to better code, and some of which... don't.


No worries. I think the philosophy is rooted in the fact that you can't control what other parties will send you; you can only control what you send in response. So that's the main thing to keep in mind.

It's sort of like the good life advice you hear occasionally: you can't control other peoples' actions; only your own. Emotional maturity, etc.


Isn't that missing the point of the robustness principle, which is more related to say, networking, and accepting things that aren't strictly to RFC spec, but when sending things, you match the spec to the letter?


I've found it useful in software design in general.


Be liberal in a well-defined way in what you accept. Accepting a variety of input is fine, as long as it is formally defined and recognized. The robustness principle is not an excuse to be sloppy with the input.

(why? see [2] in my other post, "The Science of Insecurity")


https://en.wikipedia.org/wiki/End-to-end_principle

Most of the advice in this thread would accidentally disable ftp support.


Also file:// urls pointing to server shares.

(Anyway, only MSIE supports this from http(s) origin, and then people wonder, why MSIE is still being used).


Just as a concrete example of why this is the right approach, there is at least one enterprise CMS that installs a custom protocol handler that can be use to access any object stored in the CMS if you know or can guess/discover its URI. It's worth assuming there are others that you don't know about.

The other advantage of the whitelist approach here is that you know exactly which protocols you think you support and can design tests for them. For instance to support https, you'll want to check you have decent error handling and do not silently accept potential MitM certificates.


But if your code returns a URL, please don't do this. You should allow PRURLs (protocol-relative URLs). As in "//url".

Especially you, Hubspot. I should be able to set a protocol-relative thank you page URL on your forms. If my user reaches your embedded form on my page as http, you should give them http. If they do it on https, it should give them https.

Yes, I have an axe to grind.


Also make sure to fully resolve the DNS down to all possible IP addresses, and verify that they are all external to your network. And if you're on EC2, make sure nobody is hitting 169.254.169.254.

Really, there are so many gotchas around fetching user-supplied URLs that it's scary.


Importantly, fetching DNS twice (once to check, another to download) is an incomplete solution, since DNS responses can change (cf "DNS rebinding").


And be sure to check it again if there's a redirect. Don't let your URL handling library do this. Alternately send all your traffic through a proxy that can't talk into your network.


I remember this was exactly how a readability service (readability or instapaper or something similar, can't recall now) was attacked. The service allowed you to fetch internal urls and presented them formatted on your phone. A mixture of file:// and internal web urls allowed complete takeover.


Any chance you could dig up the details on that?


You seriously want to disallow gopher://? C'mon, man!


Also, fully decode the input string before doing this processing to make sure you really find all those sequences. A simple, seemingly obvious step that a surprising amount of software neglects to do.


I believe uri today accept all sorts of characters outside ascii.


This is the correct answer and best practice. Be conservative in what you accept and liberal in what you produce.


(Also be weary of imagetragick-type bugs too, where the URL starts innocuously and then contains some shellcode, because you pass the URL to something that'll paste it into system() call)


I certainly am weary of bug branding...


I only started seeing this weary/wary misspelling in recent years. They don't sound alike, and they don't really look alike. Did cell phone spellcheckers give rise to this one?


> /ˈwɪəɹi/

> /wɛəɹ.i/

I'm not a native English speaker and I would never have guessed that they were not pronounced the same. (I'm still not even sure how it is pronounced as ɪ doesn't seem to exist in French and I always considered the examples I find were just "i").

I know English pronunciation is generally weird, but seriously how can you expect wary to sound like wear while weary sounds like something else?


"wear" is the word that's doing things wrong here.

Would "geary" and "gary" avoid the same mistake?

> seriously how can you expect

I can also question how you would expect two words that differ by their vowel to sound the same.


'weary' means tired of something, 'wary' means cautious of something.


Because our language is a mashup of multiple other languages, and thus the rules are inconsistent.

As for pronunciation of those two:

weary is pronounced like "ear" wary is pronounced like "air"


Right, and in the last year or so I've started to see people getting these terms confused.


I think that this one:

> I certainly am weary of bug branding...

means exactly what it says: the poster is tired of, not apprehensive about, it.

EDIT: Oops, sorry, I guess you meant your grandparent's

> Also be weary of imagetragick-type bugs

For what it's worth, I have seen this as a genuine confusion, not typo, of non-native English speakers. (Think, for example, of 'compose' versus 'comprise', and even of 'who' versus 'whom', which neither sound nor look alike, but which are frequently confused even by native speakers.)


I know, me too, but I didn't name it.

There's a section on the main page imagetragick.com talking about branding and how they got no traction without one.


"Despite reporting the problem to the author on Friday, and following up the report via Twitter this has not yet been fixed, but after four days I assume I'm not alone in spotting this."

Giving someone a weekend to fix something doesn't exactly sound like responsible disclosure. I understand if you get excited because you found a flaw but if you find something like this please be more responsible with publishing your findings.


Well, I found the same by pure chance before reading this article, and I suspect many more in the HN crowd did.

If already half a dozen people on HN report they’ve found it and emailed the person about it, it’s likely it’s too late for responsible disclosure.


Agreed! Better get that karma before someone else does /snark


The issue is more that if so many people have already found it, who else has?


Disclosing it publicly before it's been fixed only increases the number.


The vulnerability is happening because the web server is running as a user with elevated permissions. If it were running as a user who only had permissions to read files from the webserver's serving directory this would obviate the problem.

Not that you shouldn't also validate input (in a whitelist rather than blacklist as others have said): just goes to show that security is a multifaceted problem with lots of ways to be paranoid. :)


You don't need elevated privileges to read /etc/passwd


That is correct.

    sebboh@namagiri:~$ ls -al /etc/passwd
    -rw-r--r-- 1 root root 1450 Jun  7 09:08 /etc/passwd
So, there's this idea of running publicly facing services as a user which has less privileges than a normal interactive user. This user might be called 'nobody' or 'apache', etc.

However, on your average distribution, /etc/passwd is accessible to the 'nobody' user or the 'apache' user.

There are various work arounds:

+ run your service in a chroot or jail.

+ employ some kind of SE Linux thing.

+ ...?

I've only ever used the first one.



That doesn't mean you should allow your app to access `file://` ... (Nor allow it out of a chroot.)

Simply knowing some usernames on your system could provide an attacker with clues...


Also be sure not to allow loopback connections (ie your own site or localhost) or you can cause a deadlock if the user requests the same URL to download from your site recursively. Choose an appropriate timeout too to prevent users tying up backend processes with a HTTP server that is slow to respond, and don't follow Location headers to avoid bypassing of your initial filters. A Range header should also be used to prevent users from telling your server to download multi-GB files and causing bandwidth waste / denial of service.


> Also be sure not to allow loopback connections (ie your own site or localhost) or you can cause a deadlock if the user requests the same URL to download from your site recursively.

Keep in mind that any arbitrary domain can point DNS to a loopback or LAN address. So the code that fetches a URL needs to include this filtering.

> don't follow Location headers to avoid bypassing of your initial filters

Often, you'll want to follow redirects, but re-apply your filters when doing so.

> A Range header should also be used to prevent users from telling your server to download multi-GB files and causing bandwidth waste / denial of service.

You can't count on support for Range, and the server might also just behave unexpectedly. Your fetching code needs to limit how much data it accepts from a server, and drop the connection after some upper bound.


> Your fetching code needs to limit how much data it accepts from a server, and drop the connection after some upper bound.

And if you accept compressed responses, remember that they can be a lot bigger when decompressed: e.g. gzip allows for factor 1000.


Also be very wary of ../ or possibly ..\ in URIs.

Say you have http://example.org/download?file=release/software-1.0.zip so that you can log download statistics.

If you just fetch GET["download"] and return it, you're gonna have a bad time if they try http://example.org/download?file=../../../../../etc/hosts (browsers strip it out automatically, but it's easy enough to type such a request into a telnet session.)

And don't just assume your safeguard works. Act like a hacker and try it out on yourself to make sure it works. Add an assert(validate("../") == false); at startup on your server, so it won't even run otherwise. Forbid the use of fopen() and instead go through your own file::open() function that calls validate() internally, then #define fopen ERROR_DONT_USE after (or whatever equivalent for the language you use.)

It is a hostile world, you can never be too safe by adding multiple (even seemingly redundant) layers of protection.


Also, you can't just find and replace ../

Consider: ....// which becomes ../


I've done something similar in node to ensure URIs like those cannot walk up past a base directory:

  function saferesolve(base, target) {
    var targetPath = '.' + path.posix.normalize('/' + target)
    return path.posix.resolve(base, targetPath)
  }

  saferesolve("./datasource", "a/b") === "./datasource/a/b"
  saferesolve("./datasource", "a/b/../c") === "./datasource/a/c"
  saferesolve("./datasource", "../..") === "./datasource"
  saferesolve("./datasource", "../../a/b") === "./datasource/a/b"
  saferesolve("./datasource", "../../a/b/..") === "./datasource/a"


I've done something similar in Java, though more focused on file paths https://github.com/aJanuary/basepath


Also make sure you don't follow 301/302, or someone can set up a http link which redirects to file:// .


For those using curl for this, it has flags to specify protocol filters for initial request and redirects

https://curl.haxx.se/libcurl/c/CURLOPT_PROTOCOLS.html


Or just, when following 301/302, call the same function again – which then validates the link completely again.


I'm not sure why you're getting downvoted, it's a little unfair for people to downvote a sensible suggestion without explaining why.

One possible downside is that someone could Redirect A -> B and redirect B -> A, which risks tying up your resources following links, but browsers limit how many redirects will be followed, so it ought to be possible to limit redirects.


Wait, what? Really!? Can this be used to get shell access somehow? I'm having trouble figuring out how you go from Chrome opening a remote file to _bad thing happens_.


It's not necessarily browsers, they often mitigate this.

The problem is with services that consume other content. For example you might have a service which generates thumbnails of sites.

That service might GET https://attacker.example.org/301.html which itself might 301 back to file:///etc/passwd . If there is insufficient validation then a screenshot of the contents of /etc/passwd might be returned by the service.

All of that happens outside the context of browsers and sandboxing.

For more of that kind of thing, here's an interesting write up on some vulnerabilities found in Pocket. https://www.gnu.gl/blog/Posts/multiple-vulnerabilities-in-po...


The u= parameter in the OPs article also is vulnerable (even if http/https are whitelisted, file:// blacklisted, etc) to the #10 vulnerability on the OWASP Top Ten 2013 list, namely Unvalidated Rediredcts and Forwards. https://www.owasp.org/index.php/Top_10_2013-A10-Unvalidated_...

I can easily get you to click the link to drive-by malware, adult sites, pharma, phishing, etc. because the site doesn't ensure where the link is actually going to.


This is the confused deputy problem. The most general solution to this class of vulnerabilities, SELinux, has been largely ignored. Does SELinux need more work to "bring it to market", or is it just too complicated and needs to be simplified?


SELinux is more the stopgap measure when everything else failed already, or at least it should be and prevent the most harmful things like reading random stuff from /etc. It is not something I'd say of "I got SELinux, now I don't need to validate user input".

In the concrete example from the article, the process needs to access to /etc/hosts to do name resolutions, yet it should not send this information out to who knows who. How do you model that as a SELinux config? You cannot really. Unless you introduce dedicated (class of uncoupled, distinct, identifiable class of) processes acting as agents for resolving hosts with the help of /etc/hosts and whitelist them in SELinux... Which adds a whole lot of complexity. And you still have to make sure your new fancy agents cannot be tricked into giving up sensitive information.

So at the end of the day, you should do defense in depth which of course should include user input validation and probably SELinux as well.


SELinux is a real burden for even motivated sysadmins.

However, if you have a single image that you are going to make millions of copies of then the effort vs reward might slant in SELinux's favour, e.g. Android does use SELinux.


SELinux is not a solution to this class of vulnerabilities; it's a backup plan. The right way is to not have stupid APIs that are easy to do dangerous things with. Compare PHP's fopen wrappers with the requests Python package, for example.


configuring SELinux is way too complicated for the average user.


And poorly documented. I learned what little I know about it from online tutorials, not the docs


Then use one of the nowadays many alternatives - there is Apparmor, Tomoyo, and GRSec's RBAC all performing the same MAC job.


Quick FYI. It's difficult to filter URIs with regex because they are context free grammars and belong to a larger language than regular expressions.

Recursive descent parsers are the way to go like uri objects in most languages.

Ex. URL url = URL.fromString(inputString); url.getScheme;


The original link seems down, but it has been saved at archive.org: http://web.archive.org/web/20160912105232/https://blog.steve...


Looks fine for me (the author), but glad to see a cached copy the content is pretty minimal.

Rate-limiting hasn't kicked in, and I'm seeing a steady stream of visitors.


It seems to be inaccessible for me here in Australia, but works when I access it via a VPN to Europe. Possibly traffic is blocked from certain networks/regions.


Also, if you're using ruby's open-uri: http://sakurity.com/blog/2015/02/28/openuri.html


This is a special case of a Server-Side Request Forget vulnerability. Validating schemes is part of the answer but not the whole answer because attackers can still forge requests to internal resources you have firewalled off from the internet.

These were recently released to help people deal with these issues since the details can be finicky: http://blog.includesecurity.com/2016/08/safeurl-server-side-...


Really. Surely if I tried to enter http://localhost:8983/solr/admin/cores?action=UNLOAD&core=co... the back service would be password protected.

Oh, wait a minute.


Should say "Server-Side Request Forgery", damn mobile.


file:// urls are awesome! People forget about them and nobody actually understands what to expect from these urls. Consider cross-origin policy implications. Is file:///home/john/bar.html on the same origin as file:///home/john/foo.html ?

Also, XHR is totally allowed to go to file://


> Actually the actual output all newlines had been stripped.

Not stripped, but replaced by spaces. Also, the linked image looks like /etc/passwd, not /etc/hosts.

> Weird.

Not weird. That's how whitespace in HTML works.


OP tested his hack on an online Markdown converter, so it probably has more to do with Markdown's treatment of whitespace than HTML's.

As it happens, Markdown ignores most non-consecutive newlines.


It's a converter taking HTML as input, converting it to Markdown.


Another fun one that's becoming more common with cloud hosted services: access to internal metadata services that issue credentials.

Blocking file:// doesn't prevent access from > http://169.254.169.254/latest/meta-data/iam/security-credent....


I usually only do this, if I'm putting a user-submitted URL into file_get_contents: if(substr($_GET['url'], 0, 4) != 'http') { exit; }


> file_get_contents

FYI, if you ever need a quick performance boost, switch to using curl for these calls. The difference is noticeable to the naked eye.


If thats the only validation on calls to file_get_contents, that could very easily be bypassed. Entering something like just "/etc/passwd" for example.


/etc != http :)


Ah yes, you're right. Totally mis-read the code.

It'd still leave access to any files in the same (or sub) directory starting with http, which realistically would probably be none but still something to bear in mind.


Right, best to check the whole URL. Something like "http/../../../etc/passwd" might get through otherwise.


Like....

   $url = rawurldecode($_GET['url']);
   $url_without_protocol = str_replace(array('https://', 'http://'), '', $url);
   $protocol = (stristr($url, 'https://') ? 'https' : 'http');
   $page = file_get_contents($protocol . '://' . $url_without_protocol);


What about

    u = urlparse(url)
    if u.scheme not in ['http', 'https']:
        return 400


There are others things to consider when fetching data from untrusted HTTP resources.

A. Can it be used to DDoS

B. How much data do you read (download, inbound may cost you real money)


I think handling (rejecting) this in the libraries that handle URIs would be the wrong fix. While this is a vulnerability in a web app, it is a handy feature in an app that runs locally, and the library doesn't really know the context. It's up to the app to filter the URLs according to its isolation requirements.


This is no different than failing to validate input parameters, and letting an attacker take a dump of your database. A service connecting to a remote location should have checked the URI scheme.


I tried Python requests and Common Lisp drakma, and neither of them can handle "file://" URL schema. Which HTTP client libraries are actually vulnerable to this?


The most popular HTTP library is libcurl, and it supports file URLs by default.


Not to nitpick but it's not really a vulnerability.

It's including the correct protocols for URIs.


urllib/urllib2's urlopen for example. I would assume it would be a bug for the requests library to allow file:///


Python's urllib(2), Perl's LWP


Don't write server code that opens URI's that come in as input, period. If you take URI's as input, do it only to turn them around and spit them out into some Javascript sent back to the same session. Whatever can or cannot be accessed this way is the browser's problem.


This works as long as your business model doesn't have opening user's URIs at its core, e.g. monitoring service.


a blacklist is generally not a good idea in any case. You should use a whitelist instead.


fuckyeahmarkdown fixed it just now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: