"Why didn't we just use their FSS/UTF? As I remember, it was because in that first phone call I sang out a list of desiderata for any such encoding, and FSS/UTF was lacking at least one - the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization."
I'm of two minds about this. On the one hand, such an ability is pretty useless in modern systems filled with checksums at every stage, and reduces the bit efficiency of UTF-8. On the other hand, this was the only valid-y reason at the time to rewrite a UTF implementation from scratch.
If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.
Edit: Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.
> Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.
There are other useful properties of UTF-8 encoding. For one, it is easy to identify invalid UTF-8 sequences, and valid UTF-8 sequences are unlikely to appear in written text encoded as e.g. ISO 8859-1. This must have been more useful in the past when fixed 8-bit encodings were more common. With a simple VLQ your guess whether a stream complies to your encoding won't be nearly as informed, and the means to provide a fallback is diminished. Allowing graceful transition from ASCII and extended ASCII encodings was a very important property for the sake of adoption.
It's also useful that any non-ASCII sequence has the msbit set through the whole sequence. You can easily discard all sequences that can not be displayed as ASCII by throwing bytes >= 0x80, or for example replace every contiguous sequence of >= 0x80 bytes as question marks in an ASCII-only display system and still display all ASCII compatible sequences perfectly.
EDIT: It's also very much not the case that the days of starting read mid-stream are over.
The ability to pick up byte streams mid-run is not useless. It's required for example to jump to arbitrary locations in a file and make sense of the data you find there.
Imagine a text editor displaying a large CSV file. Wouldn't you mind having to read everything betseen two locations if you jump forward from the one to the other? Or read everything from the start if you jumping backwards? Even if the text editor stores its own synchronization points, it has to read the file completely at least once, which can be annyoing for very large files.
Also, many of the simpler text tools that only look for ASCII bytes wouldn't work. For example, printing the last lines.
Also, I don't know that other sytem, but what about robustness if there's one bad byte somewhere in the file?
> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.
There is a requirement that 7-bit ASCII character codes would not appear as a part of non-ASCII character encoding, so NULL-terminated strings, slash path separation and other similar issues could be handled by existing charset-independent code. That would not work with regular VLQs.
... when it could have been done much more simply by putting the continuation bit in position 6 and keeping bit 7 set across the entire multibyte sequence:
4) The first byte should indicate the number of bytes to
follow in a multibyte sequence.
This is actually a pretty smart requirement for efficiency.
This way the parser can know if the input bytes it has constitute a UTF-8 sequence by only looking at the first byte. That saves a lot of unnecessary processing.
if the next n bytes constitute an incomplete, but so far valid, multibyte character. Nothing is written to
The "so far valid" forces implementations to still look at the other bytes, which undoes the optimization in UTF-8.
I suspect somebody was asleep at the wheel there.
Regarding your question about the wasted bits: I don't think it matters much. It certainly does not for English text like we exchange here, where the important case is the 0-0x7f case which UTF-8 handles optimally.
Your maximum compactness variant means if you start in the middle of a sequence, you can't tell you are in the middle and not at the start. UTF-8 is doing this well in my opinion. You can seek anywhere in a file and then move forwards or backwards till the beginning of a sequence without fear of misparsing the middle of a correct sequence as a different sequence.
Yes, the ultra-compact encoding will not be self-sequencing. But the bit-6-continuation variant yields more bits per byte, which would give better compression in many languages. Regarding efficiency, you still have to read every character, with the difference being one check every 1, 2, 3, or 4 characters in a more complex algorithm vs a check on every character in a simpler algorithm (I haven't checked to see which beats the other in performance, but they look pretty similar).
This feels a lot like mixing transport layer metadata into the data format, potentially giving a small processing performance benefit at the cost of huge data wastage when certain languages are encoded.
> Is there a language that consistently uses codepoints with more than 2 bytes?
There are definitely (small) communities using scripts that lie entirely in the SMP. For example, Mru, Adlam, Takri, Pracalit, Miao, Wancho, etc. Most of these are either historic scripts that have mostly been supplanted by unified ones (esp. Devanagari) but retain usage in some areas, or languages that did not have a pre-colonial writing system that are attempting to reclaim cultural identity with a new script.
But yes, I don't think there are major communities that consistently do so. My anecdata from a few Mandarin- and Japanese-speaking friends is that SIP characters rarely occur.
Really if anything, emoji obsessives, mathematicians using bold/fraktur characters, and historical linguists/anthropologists would have the biggest savings.
Yeah, it's fine that the encoding is not infinitely extendable. But if it didn't encode the length into the first byte, you'd have 1 extra bit for 2-byte sequences, 2 extra bits for 3-byte sequences, etc. That means that you can double the number of possible glyphs per 2-byte sequence (for an extra 2048 glyphs in the 2-byte range). For 3 bytes, it's 2 bits for an extra almost 200,000 glyphs before having to jump to 4 bytes. This would be a huge boon to East Asian languages like Chinese and Japanese.
If there is no length encoding, start, middle, end and ASCII byte encodings must not overlap to support correct decoding of any subslice of a document.
If there is a length encoding on the start, then start, middle and ASCII must not overlap.
because then you don't know when you jump in the middle of a stream and see a 11xxxxx whether it's the beginning of a valid multibyte character and you have to keep it or part of a multibyte you have to discard it
same with the second encoding, if so happens that one 1xxxxxxx takes the value of 11010101 how would you know it's a multibyte start or a continuation?
basically in the current solution if you read the head of a multibyte character you know it's a valid head, if you read the head of a multibyte in the proposed encoding you can't know if it's valid.
Yes, the second one is not self-synchronizing, so it's out. However, I don't see much utility in getting first-character detection from the data format itself. You're very unlikely to miss the beginning of a stream of characters in any modern system, and in the event that your medium has no error detection, it would be trivial to add a zero synchronization byte to the beginning of the field.
My point is that embedding transport level metadata into the data format seems like a poor tradeoff because of the sheer inflation potential of the data encoding (potentially 10%), when a single guard byte per field would solve the problem of first character truncation detection.
> I don't see much utility in getting first-character detection ..
> .. in the event that your medium has no error detection ... add a zero synchronization byte
How would such encoding deal with non-utf8-safe editors, copy-pasting, programs truncating, then inserting previously broken sequences, etc?
Encoding obviously can't fix all errors, but it is quite useful if broken sequences are obviously broken and non-broken sequences remain valid when handling text in non-aware/non-safe applications.
I think in UTF8 two splices can generate a random character, but in a characters + splice combination, the character remains recognizable in any order and combination and a lone splice is also recognizable as an error.
There are now encoding even more efficient than VLQ, but they blue the line in between encodings and compression algorithms. Most propose to trade efficiency of encoding 7 bit chars for ability to squeeze few thousands common Chinese characters into 16 bits.
The idea I heard was that to always code in 4 bytes blocks, and use some form of delta encoding. Some variations allow for less than OlogN character position search. And given that you can feed 32 wide data into NEON/SSE, and block are always 32 bit aligned, you can have that working faster than UTF-8
> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.
While not technically equivalent, synchronized byte stream allows a backward scanning that is beneficial for many Unicode-related algorithms. In fact synchronization is the simplest way to do that.
Provided that you know where the string boundary is, so you don't scan off the front into differently encoded data. So it can't be backward scanned for some miscellaneous encoded number.
Ah, good point. I'm still confusing which one is which. I was specifically thinking of UTF-1 that is definitely not synchronized nor backward scannable.
I'm of two minds about this. On the one hand, such an ability is pretty useless in modern systems filled with checksums at every stage, and reduces the bit efficiency of UTF-8. On the other hand, this was the only valid-y reason at the time to rewrite a UTF implementation from scratch.
If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.
Edit: Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.