Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior was already relied upon by legacy systems out in the wild and the "fix" was added in a backwards-compatible way.
Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).
That's a terrible excuse. MySQL should have fixed this in a major release. I had to work with a production system that had all kinds of issues because of this bug (engineers assumed, with good reason, that UTF-8 meant UTF-8, when it did not).
This kind of reasoning is how we end up with vulnerabilities like the recent one in Log4j. Just because a behavior made sense in the past, or an unfortunate bug made it into production, is no excuse to let it inflict damage in perpetuity.
I don't see how introducing a new major release would fix this? people would use the old version (because of the breaking change) for a while still, you might even end up in a python2/3 situation.
To the extent that it's "a python2/3 situation"... Isn't that exactly what it is now, too, then? If you have a big change to do and postpone it "because of the installed base", then that just makes it worse the longer you wait. (Unless of course you're counting on your user base to shrink over time.)
Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese
With only 3 bytes you'll completely miss plane 2, the "Supplementary Ideographic Plane" which includes tons of Chinese-Japanese-Korean Han characters.
I wish people would stop saying the supplementary characters are just for "emoji". Asian unification was very controversial and ultimately unsuccessful. Plane 1 and Plane 2 are important, especially if you're going to sell software or products in China or Japan where they are mandated.
Even that doesn't make any sense, because refusing to encode characters that require four bytes doesn't save any space; it just makes it impossible to encode those characters. Nothing about the other encoding lengths changes.
The only thing I can figure is that something somewhere is using a 16-bit quantity for decoded codepoints. Four-byte encodings are for codepoints above FFFF. (Which I guess is still someone's idea of "saving space.")
Edit: Apparently the max encoding length used to be six bytes, so there's literally no plausible explanation for this that doesn't end with "thank god I stopped having to deal with MySQL over a decade ago."
> Three bytes are enough to fit nearly any of the chars in use in any language
This is not a fixed length encoding scheme, it is a subset of UTF-8, which is a variable length encoding designed so that the most common characters only take one byte of storage. As a consequence you do not get 2^24 possible characters in three bytes, you get much less than that (less than 2^16 actually), and the benefit is compatibility and compression.
To represent the full range, UTF-8 requires up to four bytes, even though 21 bits or three bytes would do for a fixed length encoding. UTF-8 is far more efficient in storage and transfer bandwidth than a fixed length encoding would be without further compression, to the point where using a fixed UCS-4/UTF-32 style encoding is an effective way to nearly quadruple the memory requirements of a large class of programs.
utf8 is an encoding for unicode codepoints. Those codepoints are spread on a space that is extremely vast (up to ~4 billion) for which that can be represented with up to 4 bytes. It turns out emojis are positioned in a place where the first byte will never be 0, so even if there were only one it would require the full 4 bytes to encode them.
Now, technically this scheme could expand to 6-byte characters without getting confused with things like BOM/etc, however any code points larger than 2^21 wouldn't be representable in UTF-16, which has its own set of constraints. This means the unicode consortium has basically limited themselves to two million or so possible code points, which is why UTF-8 doesn't need to go more than 4 bytes. (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)
Unicode specifically limited itself to the range zero to U+10FFFF
Obviously nothing in the laws of nature forbids "a future Unicode version" from disavowing this limit, but we could say the same for whether "a future United States of America" could disavow the status of independent Indian Tribes it has previously recognised.
> (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)
On a thread a couple of years ago (https://news.ycombinator.com/item?id=20600873) it was mentioned that the UTF-8 encoding scheme can be cleanly extended to 36 bits, so even "utf8mb7" would be a possibility.
It could be he's thinking of the historical definition which included support for 5 character encoding with a maximum codepoint of U+7FFFFFFF or ~2 billion.
That was restricted I believe primarily for compatibility with more limited encodings like UTF-16.
I guess it's possible that at some future point in human history when UTF-16 has been purged from memory, the 5 character encoding might be allowed again. :)
Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).