Hacker News new | past | comments | ask | show | jobs | submit login

Randomly truncating words can have the same effect in any language. It's outright trivial to find examples in English or German. I don't understand why one has to invoke Arab script for a good example.



Yes, but you don’t end up with different glyphs. Arabic script has letter shaping, that means a letter can have up to 4 shapes based on its position within the word. If you chop off the last letter, the previous one which used to have a “middle” position shape suddenly changes into “terminal” position shape.


I'm thinking even bog-standard European umlauts, cedillas, etc go multi-byte in Unicode? (Take a string of ÅÄÖåäöÜü and chop it off at various byte limits and see.)


This is just the general behavior of truncating strings by code point when they contain decomposed glyphs. This can also impact accents etc.


I don't remember the details, only that it was a bigger deal than with umlauts. I'll see if I can find the talk again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: