Your explanation makes it sound like an incredibly stupid decision. I imagine what you're getting at is that 3 bytes were/are sufficient for the basic multilingual plane, which is incidentally also what can be represented in a single utf-16 byte pair. So they imposed the same limitation as utf-16 had on utf-8. This would have seemed logical in a world where utf-16 was the default and utf-8 was some annoying exception they had to get out of the way.
OK, but that makes perfect sense given utf-16 was actually quite widespread in 2003! For example, Windows APIs, MS SQL Server, JavaScript (off the top of my head)... these all still primarily use utf-16 today even. And MySQL also supports utf-16 among many other charsets.
There wasn't a clear winner in utf-8 at the time, especially given its 6-byte-max representation back then. Memory and storage were a lot more limited.
And yes while 6 bytes was the maximum, a bunch of critical paths (e.g. sorting logic) in old MySQL required allocating a worst-case buffer size, so this would have been prohibitively expensive.
This still makes no sense. The UTF-8 standard was adopted really in 1998-ish and the standard was already variable using 1 to 4 bytes. MySQL 4.1, which introduced the utf8 charset, was released in 2004.
Even if there were no codepoints in the 4-byte range yet, they could and should have implemented it anyway. It literally does not take any more storage because it is a variable width encoding.
> The UTF-8 standard was adopted really in 1998-ish and the standard was already variable using 1 to 4 bytes.
No, it was 1 to 6 bytes until RFC 3629 (Nov 2003). AFAIK development of MySQL 4.1 began prior to that, despite the release not happening until afterwards.
Again, they absolutely should have addressed it sooner. But people make mistakes, especially as we're talking about a venture-funded startup in the years right after the dot-com crash.
> It literally does not take any more storage because it is a variable width encoding.
I already addressed that in my previous comment: in old versions of MySQL, a number of critical code paths required allocating worst-case buffer sizes, or accounting for worst-case value lengths in indexes, etc. So if a charset allows 6 bytes per character, that means multiplying max length by 6, in order to handle the pathological case.