Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am really not trying to be a troll. Genuinely don't understand this concept of safe strings.

How could a software even look at text content and determine safeness? There are cases where string input might be limited to just letters or numbers but often it's not. As soon as punctuation or unicode (non English users) is on the table, text is basically anything and there are no general defense from that.

Parsing and static types could have restrictions on string length, min or max value for numbers, how many items in an array, but it cannot make text safe generally-speaking by any meaning of safe. It has no awareness of how the content will be used.



We're not talking about some absolute, metaphysical "safe strings" that guard against every possible flaw, but rather about better supporting an already existing safety check.

If you never thought to write an escaping function in the first, you can't write a SqlString safe type either, obviously. Equally obviously, if you can write an escaping function but you can't write a function that detects a DROP TABLE, then you can write a SqlString type but not a SelectQueryString type.

The idea being discussed here is simply that if you do write an escaping function, its signature should not be (String -> String) or (String -> Boolean) or, God forbid, (String -> void), but something like (String -> SqlString).

This ensures that whatever you feed to your database must have gone through such an escaping function, instead of expecting the programmer to simply remember it. Also prevents you from accidentally escaping a string twice.

(Obligatory pedantic disclaimer: if you're working with modern databases, please don't escape your own strings and just use parameters instead.)


I agree with you. The concepts of a safe string in isolation is too abstract too be meaningful. A correct API, such as interacting with a database only using explicit parameters (instead of string-concatenating to build up a query) is always safe, irrespective of the provenance of the input. The input could be a virus or a DB command and this would still be 100% safe.

What people mean by safe string in more specific contexts however, is meaningful, but the word "safe" is an unfortunate choice. Instead, think "SqlEscapedString" or "HtmlEscapedString" or "UriEscapedString". These are much more meaningful, and their use-case should be obvious. You can convert an arbitrary input "String" type into a "SqlEscapedString" and then safely use simple string concatenation to build up a query. This is useful in situations where non-parameter parts of the query are dependent upon the input in ways that are not safely exposed in the DB query API. For example, building up complicated WHERE clauses or using dynamic table names.

So you can write something like the following (in pseudo code):

    String tableName = ParseFromUntrustedPacket( packet );
    SqlEscapedString sqlTableName = new SqlEscapedString( tableName );
    SqlEscapedString query = SqlEscapedString.Unsafe( "SELECT * FROM " ) & 
        sqlTableName & 
        SqlEscapedString.Unsafe( " WHERE Foo is NOT NULL" );
    var result = connection.Execute( query );
The benefit of this kind of approach is that if that last function call has the signature of "Execute( SqlEscapedString q )", then it is basically impossible to accidentally pass an unescaped (unsafe) input string into it by accident. At every step, the developer is forced to make a decision to either pass in a potentially dangerous query snippet using "Unsafe(...)" or to make input strings safe by escaping them.

Similarly, this method converts Strings into a different type when escaping them, making it (almost) impossible to accidentally double-escape inputs, which is an issue commonly seen in some environments such as complex shell scripts.

ASP.NET for example does something similar with IHtmlString.


Oh, then you're reading too much into "safe" and assuming it means "can never do any bad if used in any situation, must need an AI".

It's like the same way a software can look at a number that's going to control a water heater and determine whether it's a safe temperature for a human body or not. You the programmer chose some limits. When the user enters a number, it's an unsafe value by default, because you haven't validated it.

After you validate it, you have something which is 'safe' to pass around to anywhere in your code, like a security checkpoint says that random people are unsafe, and when they enter a building their details are checked, and then they are OK to enter and go anywhere inside the building.

You, the programmer, choose what things you consider safe and unsafe and those words mean validated or unvalidated, verified or unverified, checked or unchecked, approved or unapproved, known or unknown, outside or inside, or any other pair.

> it cannot make text safe generally-speaking by any meaning of safe

If something can't be done, ever, in any situation, that probably isn't what people are talking about doing.


The point that's being made here is if you make safe and unsafe strings separate types, in a strongly-typed system, it is impossible to use an unsafe string where a safe string is expected or vice versa. When you have a boundary function that turns an unsafe string into a safe string (e.g., escaping), or that rejects strings that are not safe, you can have a system where all the inputs are unsafe and are forced to go through such a mechanism exactly once to guarantee freedom from double-escaping issues.


I think the above definition of "gets turned into safe strings early" isn't necessarily a clear one.

The general idea is to separate strings into different types, with different rules. E.g. a HTML templating engine will always escape strings unless they're of a specific type (e.g. in Python a popular implementation calls the type "MarkupSafe") that says it's ok to include as raw HTML (e.g. because it's the output of a sanitizer), an SQL query builder will only accept specially tagged strings as non-parameters into queries, ..., which reduces the likelihood of the programmer accidentally using a string in a place where it isn't correct to use. Username field doesn't have any special rules attached? All code will reject unsafe use as far as possible.


Safe in the context of HTML means semantically significant characters are escaped correctly, including <, >, “ and ‘.


Depends on context: for content, only < needs to be escaped, within a tag (but not an attribute) > needs to escaped, within an attribute quotes of the same kind that started the attribute value (if any) must be escaped. Then there are legitimate cases of richly formatted user input/markup where you want to restrict script or block-level elements, or elements that can reach out to a container element such as a paragraph or section. I could go on here, but the point is to use HTML-aware template engines and markup processors, not rely on magic escaping routines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: