Hacker News new | past | comments | ask | show | jobs | submit login

The term you probably want to look for in your web framework is "encoding".

I don't like "sanitization" personally, because it sounds like you're removing "bad stuff", but in general, "bad stuff" is not identifiable or removable because "bad stuff" is highly context-dependent, plus a lot of times the "bad stuff" is perfectly legitimate [1]. Apostrophes are "bad stuff", because they can break out of SQL queries and HTML tags, but they are also parts of people's names. Double-quotes are "bad things", but they are legitimately part of all sorts of real data. Any "sanitize(string)" function is by definition wrong because it has no place for a context to go, and it will do bad things to your data.

One of my items on my short checklist for examining an HTML templating language is "does the simplest possible way to dump out a string to the user at least do HTML encoding on the value"? That is,

    x = "<>"
    template = compileTemplate("{x}")
    template.Dump({x: x})
for whatever the simplest output of a value is, should output "&lt;&gt;"; if it outputs "<>", you've got a templating language that you ARE going to write XSS attacks in, no matter how careful you are. The time you want to dump out non-encoded text is the exception, not the rule.

Bonus points for being even more aware of the context and correctly encoding things in Javascript context vs. HTML context, etc. This isn't a magic wand that fixes everything, but in general, if it does default to a blind HTML-encode it at least means that instead of a security failure if you screw up the encoding, you'll get the user seeing some ugly stuff on their screen like &lt; instead.

[1]: Although, technically, I think it's acceptable for an HTML encoding function to just eliminate the ASCII control characters other than newline, carriage return, and tab, rather than encode them. Those are just asking for trouble, even if you encode them. Especially NUL. Even in 2019, best to keep NULs out of places they don't belong.




Exactly, sanitization is a misnomer. If you are concatenating plain text together with HTML then you have an app which is functionally broken when someone with an apostrophe in their name tries to use it -- it's not just a matter of security. The strings must be the same format (i.e both valid HTML fragments) before you concatenate them or the result will be unparsable garbage.

And the idea of "sanitization at input" is especially ridiculous: how can you know what you will be concatenating that input with until you actually do it? I.e. is it being inserted into some HTML? is It going in an attribute value or a text node? What about outputting JSON?


Right.

This is why we typically speak about defense in depth. Input sanitization works best when applied to known expected inputs, like a phone number or dob.

Output encoding is the real solution where we know where we intend any data to end up (this is how it’s displayed) so we can ensure that it’s in the correct format and that that format parser won’t interpret it as code instead of data. Ie html attribute, html, Json, JavaScript, etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: