Randomly truncating words can have the same effect in any language. It's outrigh...

asabil · 2024-06-05T17:36:38 1717608998

Yes, but you don’t end up with different glyphs. Arabic script has letter shaping, that means a letter can have up to 4 shapes based on its position within the word. If you chop off the last letter, the previous one which used to have a “middle” position shape suddenly changes into “terminal” position shape.

CRConrad · 2024-06-06T12:00:16 1717675216

I'm thinking even bog-standard European umlauts, cedillas, etc go multi-byte in Unicode? (Take a string of ÅÄÖåäöÜü and chop it off at various byte limits and see.)

gmueckl · 2024-06-06T16:50:59 1717692659

This is just the general behavior of truncating strings by code point when they contain decomposed glyphs. This can also impact accents etc.

panzi · 2024-06-06T12:56:29 1717678589

I don't remember the details, only that it was a bigger deal than with umlauts. I'll see if I can find the talk again.