"Why didn't we just use their FSS/UTF? As I remember, it was because in that fir...

boomlinde · on Oct 10, 2019

> Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.

There are other useful properties of UTF-8 encoding. For one, it is easy to identify invalid UTF-8 sequences, and valid UTF-8 sequences are unlikely to appear in written text encoded as e.g. ISO 8859-1. This must have been more useful in the past when fixed 8-bit encodings were more common. With a simple VLQ your guess whether a stream complies to your encoding won't be nearly as informed, and the means to provide a fallback is diminished. Allowing graceful transition from ASCII and extended ASCII encodings was a very important property for the sake of adoption.

It's also useful that any non-ASCII sequence has the msbit set through the whole sequence. You can easily discard all sequences that can not be displayed as ASCII by throwing bytes >= 0x80, or for example replace every contiguous sequence of >= 0x80 bytes as question marks in an ASCII-only display system and still display all ASCII compatible sequences perfectly.

EDIT: It's also very much not the case that the days of starting read mid-stream are over.

jstimpfle · on Oct 10, 2019

The ability to pick up byte streams mid-run is not useless. It's required for example to jump to arbitrary locations in a file and make sense of the data you find there.

Imagine a text editor displaying a large CSV file. Wouldn't you mind having to read everything betseen two locations if you jump forward from the one to the other? Or read everything from the start if you jumping backwards? Even if the text editor stores its own synchronization points, it has to read the file completely at least once, which can be annyoing for very large files.

Also, many of the simpler text tools that only look for ASCII bytes wouldn't work. For example, printing the last lines.

Also, I don't know that other sytem, but what about robustness if there's one bad byte somewhere in the file?

kstenerud · on Oct 10, 2019

You can do the same thing using VLQ, with less wastage. Just scan for the next cleared high bit.

mirekrusin · on Oct 10, 2019

You can also construct malicious payloads which will DoS many implementations with architecture like that.

kstenerud · on Oct 10, 2019

You can also trivially defend against that by having a max travel distance of, say, 4.

mirekrusin · on Oct 13, 2019

Exactly, so you arrived at the same conclusion - fixed distance so you can safely pick up stream at arbitrary position.

kstenerud · on Oct 13, 2019

The difference is that the distance limits would be inherent to the codec itself, rather than taking up extra space in the data stream.

ludamad · on Oct 10, 2019

Is that guaranteed to catch all valid cases?

zajio1am · on Oct 10, 2019

> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.

There is a requirement that 7-bit ASCII character codes would not appear as a part of non-ASCII character encoding, so NULL-terminated strings, slash path separation and other similar issues could be handled by existing charset-independent code. That would not work with regular VLQs.

kstenerud · on Oct 10, 2019

Actually, that's right, forgot about that :P

The only remaining mystery then would be why they bothered encoding it like this:

    0xxxxxxx
    110xxxxx  10xxxxxx        
    1110xxxx  10xxxxxx  10xxxxxx    
    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

... when it could have been done much more simply by putting the continuation bit in position 6 and keeping bit 7 set across the entire multibyte sequence:

    0xxxxxxx
    11xxxxxx  11xxxxxx  ...  10xxxxxx

Or even like this for maximum compactness:

    0xxxxxxx
    100xxxxx 1xxxxxxx
    101xxxxx 1xxxxxxx 1xxxxxxx
    110xxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
    111xxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

fefe23 · on Oct 10, 2019

One of the requirements is:

4) The first byte should indicate the number of bytes to follow in a multibyte sequence.

This is actually a pretty smart requirement for efficiency. This way the parser can know if the input bytes it has constitute a UTF-8 sequence by only looking at the first byte. That saves a lot of unnecessary processing.

Unfortunately the wording here breaks that advantage: https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16

if the next n bytes constitute an incomplete, but so far valid, multibyte character. Nothing is written to

The "so far valid" forces implementations to still look at the other bytes, which undoes the optimization in UTF-8. I suspect somebody was asleep at the wheel there.

Regarding your question about the wasted bits: I don't think it matters much. It certainly does not for English text like we exchange here, where the important case is the 0-0x7f case which UTF-8 handles optimally.

Your maximum compactness variant means if you start in the middle of a sequence, you can't tell you are in the middle and not at the start. UTF-8 is doing this well in my opinion. You can seek anywhere in a file and then move forwards or backwards till the beginning of a sequence without fear of misparsing the middle of a correct sequence as a different sequence.

kstenerud · on Oct 10, 2019

Yes, the ultra-compact encoding will not be self-sequencing. But the bit-6-continuation variant yields more bits per byte, which would give better compression in many languages. Regarding efficiency, you still have to read every character, with the difference being one check every 1, 2, 3, or 4 characters in a more complex algorithm vs a check on every character in a simpler algorithm (I haven't checked to see which beats the other in performance, but they look pretty similar).

This feels a lot like mixing transport layer metadata into the data format, potentially giving a small processing performance benefit at the cost of huge data wastage when certain languages are encoded.

marcosdumay · on Oct 10, 2019

> huge data wastage when certain languages are encoded

Is there a language that consistently uses codepoints with more than 2 bytes?

It bothers me a bit that UTF-8 is not an infinitely extendable encoding. But that also isn't an important objection, because it is finite, but huge.

XaspR8d · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

There are definitely (small) communities using scripts that lie entirely in the SMP. For example, Mru, Adlam, Takri, Pracalit, Miao, Wancho, etc. Most of these are either historic scripts that have mostly been supplanted by unified ones (esp. Devanagari) but retain usage in some areas, or languages that did not have a pre-colonial writing system that are attempting to reclaim cultural identity with a new script.

But yes, I don't think there are major communities that consistently do so. My anecdata from a few Mandarin- and Japanese-speaking friends is that SIP characters rarely occur.

Really if anything, emoji obsessives, mathematicians using bold/fraktur characters, and historical linguists/anthropologists would have the biggest savings.

https://en.wikipedia.org/wiki/Mru_language#Alphabet https://en.wikipedia.org/wiki/Adlam_script https://en.wikipedia.org/wiki/Takri_script https://en.wikipedia.org/wiki/Pracalit_script https://en.wikipedia.org/wiki/Pollard_script

kstenerud · on Oct 10, 2019

Yeah, it's fine that the encoding is not infinitely extendable. But if it didn't encode the length into the first byte, you'd have 1 extra bit for 2-byte sequences, 2 extra bits for 3-byte sequences, etc. That means that you can double the number of possible glyphs per 2-byte sequence (for an extra 2048 glyphs in the 2-byte range). For 3 bytes, it's 2 bits for an extra almost 200,000 glyphs before having to jump to 4 bytes. This would be a huge boon to East Asian languages like Chinese and Japanese.

boomlinde · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

Kanji, hiragana, katakana etc. At least 3 bytes with some 4-byte sequences. This compares unfavorably to for example Shift-JIS.

devit · on Oct 10, 2019

None of these encodings work.

If there is no length encoding, start, middle, end and ASCII byte encodings must not overlap to support correct decoding of any subslice of a document.

If there is a length encoding on the start, then start, middle and ASCII must not overlap.

avereveard · on Oct 10, 2019

because then you don't know when you jump in the middle of a stream and see a 11xxxxx whether it's the beginning of a valid multibyte character and you have to keep it or part of a multibyte you have to discard it

same with the second encoding, if so happens that one 1xxxxxxx takes the value of 11010101 how would you know it's a multibyte start or a continuation?

basically in the current solution if you read the head of a multibyte character you know it's a valid head, if you read the head of a multibyte in the proposed encoding you can't know if it's valid.

kstenerud · on Oct 10, 2019

Yes, the second one is not self-synchronizing, so it's out. However, I don't see much utility in getting first-character detection from the data format itself. You're very unlikely to miss the beginning of a stream of characters in any modern system, and in the event that your medium has no error detection, it would be trivial to add a zero synchronization byte to the beginning of the field.

My point is that embedding transport level metadata into the data format seems like a poor tradeoff because of the sheer inflation potential of the data encoding (potentially 10%), when a single guard byte per field would solve the problem of first character truncation detection.

labawi · on Oct 11, 2019

> I don't see much utility in getting first-character detection ..

> .. in the event that your medium has no error detection ... add a zero synchronization byte

How would such encoding deal with non-utf8-safe editors, copy-pasting, programs truncating, then inserting previously broken sequences, etc?

Encoding obviously can't fix all errors, but it is quite useful if broken sequences are obviously broken and non-broken sequences remain valid when handling text in non-aware/non-safe applications.

I think in UTF8 two splices can generate a random character, but in a characters + splice combination, the character remains recognizable in any order and combination and a lone splice is also recognizable as an error.

zajio1am · on Oct 10, 2019

Your second encoding (with binary coded length) would not be self-synchronizing.

DagAgren · on Oct 10, 2019

UTF-8 lets you skip entire characters without scanning the individual bytes, which VLQ does not.

baybal2 · on Oct 10, 2019

There are now encoding even more efficient than VLQ, but they blue the line in between encodings and compression algorithms. Most propose to trade efficiency of encoding 7 bit chars for ability to squeeze few thousands common Chinese characters into 16 bits.

The idea I heard was that to always code in 4 bytes blocks, and use some form of delta encoding. Some variations allow for less than OlogN character position search. And given that you can feed 32 wide data into NEON/SSE, and block are always 32 bit aligned, you can have that working faster than UTF-8

SamReidHughes · on Oct 11, 2019

Do you have a link to this 32-bit-aligned proposal?

SamReidHughes · on Oct 10, 2019

It's still useful. If you have a string or text file and want to parallelize the processing of it, this lets you cut it in half and do that.

lifthrasiir · on Oct 10, 2019

> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.

While not technically equivalent, synchronized byte stream allows a backward scanning that is beneficial for many Unicode-related algorithms. In fact synchronization is the simplest way to do that.

kstenerud · on Oct 10, 2019

VLQ can also be backward scanned, since the high bit is the continuation bit. Just scan for the next cleared high bit, which marks the end of a glyph.

SamReidHughes · on Oct 11, 2019

Provided that you know where the string boundary is, so you don't scan off the front into differently encoded data. So it can't be backward scanned for some miscellaneous encoded number.

lifthrasiir · on Oct 10, 2019

Ah, good point. I'm still confusing which one is which. I was specifically thinking of UTF-1 that is definitely not synchronized nor backward scannable.

acqq · on Oct 10, 2019

> now that I think about it, VLQ already does satisfy the synchronization requirement

What was then the advantage of UTF8 over VLQ?

SamReidHughes · on Oct 10, 2019

You can determine whether the leading character is complete or not.

acqq · on Oct 10, 2019

Other comments know, see:

https://news.ycombinator.com/item?id=21212851