Hacker News new | past | comments | ask | show | jobs | submit login

Seems the article need some updates?

>> In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

In Wikipedia:

>> UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes

Reference: http://en.wikipedia.org/wiki/UTF-8




I dont get this actually.

Say a UTF8 string is ae 31 c1 12.

Now how do we decide whether it has the characters "31","c1","ae","12" or the characters are "ae 31" and "c1 12" or even "ae","31 c1" and "12".??

EDIT: Never mind!..found my answer here http://stackoverflow.com/questions/1543613/how-does-utf-8-va...


The tldr is that UTF-8 is a prefix code: no valid character is a prefix of any other.

http://en.wikipedia.org/wiki/Prefix_code


I think both of those things are both true. I'm guessing there are currently, only 1.1M code points defined, and these fit in 4 bytes. However, there are currently-unallocated code points that go higher which could occupy he remaining 2 bytes that can be used with UTF-8.


In the past maybe it was 6, but now seems 4 is the correct one

e.g. http://golang.org/pkg/utf8/

>> UTFMax = 4 // maximum number of bytes of a UTF-8 encoded Unicode character.

Also 4 is max. value used in MySQL server.


Yes, the original UTF-8 spec was much more forgiving. The most recent spec, RFC 3629, restricted the range of code points and made the decoding of invalid sequences a MUST NOT requirement: http://tools.ietf.org/html/rfc3629#section-12


It appears you are correct, at least based on some discussion here: http://stackoverflow.com/questions/1543613/how-does-utf-8-va....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: