Consider that it's 1. completely backwards-compatible with ASCII 2. you can alwa...

dbaupp · on Sept 29, 2013

> 4. contains no null bytes, so it can fit in any normal C string

This is false; the encoding of codepoint 0 is 0x00 per the standard; modified utf-8[1] makes 0 a special case with an over-long encoding: 0xC0 0x80.

[1]: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

deathanatos · on Sept 29, 2013

Have you ever seen codepoint 0 used for anything other than terminating a string?

dbaupp · on Sept 29, 2013

Personally no, but you have to handle 0 correctly (well, mainly, consistently) otherwise you end up with an attack vector, where one part of the application stops at the 0 (e.g. string equality), and another later steps assumes the condition of the previous part but continues past the 0, resulting in it reading bytes that should've been considered previously (but weren't).

derleth · on Sept 29, 2013

Exactly, and moving from ASCII to UTF-8 means you get to keep that consistency: 0x00 means 'End of string' in ASCII and it means 'End of string' in UTF-8. No change. Never a miscommunication. No possibility of old software getting confused on this issue. Any code which had its last buffer overrun flushed out in 1983 is still free of buffer overruns in 2013.

And, if you really need to represent codepoint 0 in strings, you can use Java's Modified UTF-8, where codepoint 0 is represented by the byte sequence 0xC0, 0x80. (This isn't valid UTF-8 because in straight UTF-8, every codepoint must be represented by its shortest possible representation.)

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

dbaupp · on Sept 29, 2013

> it means 'End of string' in UTF-8

No it doesn't, unless you are saying that one should treat it like that. But null termination is as dangerous[1,2] with UTF-8 as it is with ASCII and should be avoided as much as possible anyway. Also, ASCII doesn't mandate that \0 is end-of-string, that's just a "convention" from C.

(Did you notice that my original comment actually included the exact modified UTF-8 link you provided?)

[1]: http://cwe.mitre.org/data/definitions/170.html [2]: http://projects.webappsec.org/w/page/13246949/Null%20Byte%20...

michaelt · on Sept 29, 2013

Linuxant [1] released drivers that bypass GPLONLY controls like so:

  MODULE_LICENSE("GPL\0for files in the \"GPL\" 
  directory; for others, only LICENSE file applies");

I'm not sure whether this counts as something other than terminating a string.

[1] https://en.wikipedia.org/wiki/Loadable_kernel_module#Linuxan...

derleth · on Sept 29, 2013

> This is false; the encoding of codepoint 0 is 0x00 per the standard

Well, yes. That's the point: 0x00 is only ever used to encode codepoint 0. It never shows up anywhere else. That's precisely what the text you replied to means.

dbaupp · on Sept 29, 2013

But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.

f2f · on Sept 29, 2013

that's a problem for those using 0 to terminate a string. if you don't there's no issue. you can read DMR's historical perspective on choosing 0 as a string terminator here: http://cm.bell-labs.com/who/dmr/chist.html (critique section)

even the C guys have moved on though, as Go (co-designed by Ken Thompson) illustrates.

derleth · on Sept 29, 2013

> But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.

This is indeed a tradeoff. The only alternatives are to either store the length of the string explicitly somewhere (the Pascal solution is to prepend the string with a fixed-size length parameter, which can be as awkward as it sounds when you get to really long strings) or to do something nasty with the string's actual contents, such as saying that the last byte of a string has its high bit set.

I think the C solution is the most reasonable when storage is really tight, and more flexible than the Pascal method in general, but I agree that it's potentially dangerous and it's annoying to have a byte which you can never represent in a string.

dbaupp · on Sept 29, 2013

> more flexible than the Pascal method in general

Why? C strings and Pascal strings can store exactly the same contents, except Pascal strings can store a literal \0, and are faster to manipulate (in many ways), at the cost of (sizeof(word) - 1) extra bytes.

derleth · on Sept 29, 2013

Because you can have extremely long strings without worrying about overflowing a fixed-size length prefix.

dbaupp · on Sept 29, 2013

A computer has a finite address space (e.g. 2^32 bytes on x86, 2^48 on x86-64), and always has a fixed size data type large enough to address it all; an in-memory string cannot possibly be usefully larger than this so using this finite-size type as the length-prefix is perfect & optimal.

C-strings are useful in extremely constrained environments when the extra few bytes of the length prefix vs the trailing \0 byte is too much to pay, but are essentially just a security risk in any other situation.