Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior w...

ak217 · on Jan 12, 2022

That's a terrible excuse. MySQL should have fixed this in a major release. I had to work with a production system that had all kinds of issues because of this bug (engineers assumed, with good reason, that UTF-8 meant UTF-8, when it did not).

This kind of reasoning is how we end up with vulnerabilities like the recent one in Log4j. Just because a behavior made sense in the past, or an unfortunate bug made it into production, is no excuse to let it inflict damage in perpetuity.

Arch-TK · on Jan 12, 2022

I don't see how introducing a new major release would fix this? people would use the old version (because of the breaking change) for a while still, you might even end up in a python2/3 situation.

CRConrad · on Jan 21, 2022

To the extent that it's "a python2/3 situation"... Isn't that exactly what it is now, too, then? If you have a big change to do and postpone it "because of the installed base", then that just makes it worse the longer you wait. (Unless of course you're counting on your user base to shrink over time.)

CountSessine · on Jan 12, 2022

Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese

With only 3 bytes you'll completely miss plane 2, the "Supplementary Ideographic Plane" which includes tons of Chinese-Japanese-Korean Han characters.

I wish people would stop saying the supplementary characters are just for "emoji". Asian unification was very controversial and ultimately unsuccessful. Plane 1 and Plane 2 are important, especially if you're going to sell software or products in China or Japan where they are mandated.

_moof · on Jan 12, 2022

Even that doesn't make any sense, because refusing to encode characters that require four bytes doesn't save any space; it just makes it impossible to encode those characters. Nothing about the other encoding lengths changes.

The only thing I can figure is that something somewhere is using a 16-bit quantity for decoded codepoints. Four-byte encodings are for codepoints above FFFF. (Which I guess is still someone's idea of "saving space.")

Edit: Apparently the max encoding length used to be six bytes, so there's literally no plausible explanation for this that doesn't end with "thank god I stopped having to deal with MySQL over a decade ago."

wruza · on Jan 12, 2022

Then they should have considered implementing utf-8mb3le surrogate pairs. What a missed opportunity!

butlerm · on Jan 13, 2022

> Three bytes are enough to fit nearly any of the chars in use in any language

This is not a fixed length encoding scheme, it is a subset of UTF-8, which is a variable length encoding designed so that the most common characters only take one byte of storage. As a consequence you do not get 2^24 possible characters in three bytes, you get much less than that (less than 2^16 actually), and the benefit is compatibility and compression.

To represent the full range, UTF-8 requires up to four bytes, even though 21 bits or three bytes would do for a fixed length encoding. UTF-8 is far more efficient in storage and transfer bandwidth than a fixed length encoding would be without further compression, to the point where using a fixed UCS-4/UTF-32 style encoding is an effective way to nearly quadruple the memory requirements of a large class of programs.

cblconfederate · on Jan 12, 2022

couldnt emoji fit in 3 bytes as well? I don't think there are that many ...

rakoo · on Jan 12, 2022

utf8 is an encoding for unicode codepoints. Those codepoints are spread on a space that is extremely vast (up to ~4 billion) for which that can be represented with up to 4 bytes. It turns out emojis are positioned in a place where the first byte will never be 0, so even if there were only one it would require the full 4 bytes to encode them.

the_mitsuhiko · on Jan 12, 2022

> Those codepoints are spread on a space that is extremely vast (up to ~4 billion)

You are off by a lot. The maximum code point is about 21bit high (0x10FFFF). The space is only 1.1 million large.

ninkendo · on Jan 12, 2022

Related: time to explain UTF-8! (In case anybody is curious. I personally think it's extremely clever and worth understanding:)

Characters <128 are encoded with a single byte: 0xxxxxxx

Characters >128 are encoded with multiple bytes.

A two-byte character looks like:

110xxxxx 10xxxxxx (11 useful bits, representing code points 128-2047)

A three-byte character looks like:

1110xxxx 10xxxxxx 10xxxxxx (16 useful bits, representing code points 2048-65535)

A four-byte character looks like:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 useful bits, representing code points 65536-2097151)

Now, technically this scheme could expand to 6-byte characters without getting confused with things like BOM/etc, however any code points larger than 2^21 wouldn't be representable in UTF-16, which has its own set of constraints. This means the unicode consortium has basically limited themselves to two million or so possible code points, which is why UTF-8 doesn't need to go more than 4 bytes. (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)

tialaramex · on Jan 13, 2022

Unicode specifically limited itself to the range zero to U+10FFFF

Obviously nothing in the laws of nature forbids "a future Unicode version" from disavowing this limit, but we could say the same for whether "a future United States of America" could disavow the status of independent Indian Tribes it has previously recognised.

cesarb · on Jan 13, 2022

> (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)

On a thread a couple of years ago (https://news.ycombinator.com/item?id=20600873) it was mentioned that the UTF-8 encoding scheme can be cleanly extended to 36 bits, so even "utf8mb7" would be a possibility.

capitainenemo · on Jan 12, 2022

It could be he's thinking of the historical definition which included support for 5 character encoding with a maximum codepoint of U+7FFFFFFF or ~2 billion.

https://en.wikipedia.org/wiki/UTF-8#History

That was restricted I believe primarily for compatibility with more limited encodings like UTF-16.

I guess it's possible that at some future point in human history when UTF-16 has been purged from memory, the 5 character encoding might be allowed again. :)

_moof · on Jan 12, 2022

Almost all emoji are four bytes in UTF-8.

cblconfederate · on Jan 12, 2022

were all the other positions taken?