Seems the article need some updates? >> In UTF-8, every code point from ...

nodemaker · on Jan 11, 2012

I dont get this actually.

Say a UTF8 string is ae 31 c1 12.

Now how do we decide whether it has the characters "31","c1","ae","12" or the characters are "ae 31" and "c1 12" or even "ae","31 c1" and "12".??

EDIT: Never mind!..found my answer here http://stackoverflow.com/questions/1543613/how-does-utf-8-va...

pjscott · on Jan 11, 2012

The tldr is that UTF-8 is a prefix code: no valid character is a prefix of any other.

http://en.wikipedia.org/wiki/Prefix_code

notJim · on Jan 11, 2012

I think both of those things are both true. I'm guessing there are currently, only 1.1M code points defined, and these fit in 4 bytes. However, there are currently-unallocated code points that go higher which could occupy he remaining 2 bytes that can be used with UTF-8.

tszming · on Jan 11, 2012

In the past maybe it was 6, but now seems 4 is the correct one

e.g. http://golang.org/pkg/utf8/

>> UTFMax = 4 // maximum number of bytes of a UTF-8 encoded Unicode character.

Also 4 is max. value used in MySQL server.

dlitz · on Jan 11, 2012

Yes, the original UTF-8 spec was much more forgiving. The most recent spec, RFC 3629, restricted the range of code points and made the decoding of invalid sequences a MUST NOT requirement: http://tools.ietf.org/html/rfc3629#section-12

notJim · on Jan 11, 2012

It appears you are correct, at least based on some discussion here: http://stackoverflow.com/questions/1543613/how-does-utf-8-va....