Hacker News new | past | comments | ask | show | jobs | submit login

Consider that it's

1. completely backwards-compatible with ASCII

2. you can always tell if you're reading a part of a multibyte character or not, meaning that you can tell if you're reading a message that was cut in some random point and can tell how many bytes to drop before reaching the first start of a character

3. endian-agnostic due to being specified as a byte stream

4. contains no null bytes, so it can fit in any normal C string

Point 1 is absolutely vital for backwards-compatibility and 2 makes it better than a lot of other multibyte encodings. Consider for instance chopping one byte off the start of a ucs2-encoded string - you'll get complete garbage. And 3 means you don't get strange endian-related errors.

It's a robust, resilient encoding that's a drop-in replacement for ASCII and needs no special support from many utilities - tail and head for instance can work with UTF8 text as if it's plain ASCII, just looking for a \n byte and chopping the input in lines.

So yes, I do think it's elegant. It may not be the simplest possible encoding for unicode, but it's extremely practical.




> 4. contains no null bytes, so it can fit in any normal C string

This is false; the encoding of codepoint 0 is 0x00 per the standard; modified utf-8[1] makes 0 a special case with an over-long encoding: 0xC0 0x80.

[1]: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


Have you ever seen codepoint 0 used for anything other than terminating a string?


Personally no, but you have to handle 0 correctly (well, mainly, consistently) otherwise you end up with an attack vector, where one part of the application stops at the 0 (e.g. string equality), and another later steps assumes the condition of the previous part but continues past the 0, resulting in it reading bytes that should've been considered previously (but weren't).


Exactly, and moving from ASCII to UTF-8 means you get to keep that consistency: 0x00 means 'End of string' in ASCII and it means 'End of string' in UTF-8. No change. Never a miscommunication. No possibility of old software getting confused on this issue. Any code which had its last buffer overrun flushed out in 1983 is still free of buffer overruns in 2013.

And, if you really need to represent codepoint 0 in strings, you can use Java's Modified UTF-8, where codepoint 0 is represented by the byte sequence 0xC0, 0x80. (This isn't valid UTF-8 because in straight UTF-8, every codepoint must be represented by its shortest possible representation.)

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


> it means 'End of string' in UTF-8

No it doesn't, unless you are saying that one should treat it like that. But null termination is as dangerous[1,2] with UTF-8 as it is with ASCII and should be avoided as much as possible anyway. Also, ASCII doesn't mandate that \0 is end-of-string, that's just a "convention" from C.

(Did you notice that my original comment actually included the exact modified UTF-8 link you provided?)

[1]: http://cwe.mitre.org/data/definitions/170.html [2]: http://projects.webappsec.org/w/page/13246949/Null%20Byte%20...


Linuxant [1] released drivers that bypass GPLONLY controls like so:

  MODULE_LICENSE("GPL\0for files in the \"GPL\" 
  directory; for others, only LICENSE file applies");
I'm not sure whether this counts as something other than terminating a string.

[1] https://en.wikipedia.org/wiki/Loadable_kernel_module#Linuxan...


> This is false; the encoding of codepoint 0 is 0x00 per the standard

Well, yes. That's the point: 0x00 is only ever used to encode codepoint 0. It never shows up anywhere else. That's precisely what the text you replied to means.


But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.


that's a problem for those using 0 to terminate a string. if you don't there's no issue. you can read DMR's historical perspective on choosing 0 as a string terminator here: http://cm.bell-labs.com/who/dmr/chist.html (critique section)

even the C guys have moved on though, as Go (co-designed by Ken Thompson) illustrates.


> But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.

This is indeed a tradeoff. The only alternatives are to either store the length of the string explicitly somewhere (the Pascal solution is to prepend the string with a fixed-size length parameter, which can be as awkward as it sounds when you get to really long strings) or to do something nasty with the string's actual contents, such as saying that the last byte of a string has its high bit set.

I think the C solution is the most reasonable when storage is really tight, and more flexible than the Pascal method in general, but I agree that it's potentially dangerous and it's annoying to have a byte which you can never represent in a string.


> more flexible than the Pascal method in general

Why? C strings and Pascal strings can store exactly the same contents, except Pascal strings can store a literal \0, and are faster to manipulate (in many ways), at the cost of (sizeof(word) - 1) extra bytes.


Because you can have extremely long strings without worrying about overflowing a fixed-size length prefix.


A computer has a finite address space (e.g. 2^32 bytes on x86, 2^48 on x86-64), and always has a fixed size data type large enough to address it all; an in-memory string cannot possibly be usefully larger than this so using this finite-size type as the length-prefix is perfect & optimal.

C-strings are useful in extremely constrained environments when the extra few bytes of the length prefix vs the trailing \0 byte is too much to pay, but are essentially just a security risk in any other situation.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: