Hacker News new | past | comments | ask | show | jobs | submit login

A nit - it's not UTF-8 or "multibyte" characters that's the main problem. The UTF-8 issue can be trivially resolved by decoding it into unicode code points. As long as you're fine with the truncated length not always corresponding to what you'd expect for Latin based alphabets it should be fine. (FWIW, if you are concerned with the displayed length, you'd need a font and a text layout engine to calculate the display length of displayed text)

The main issue with naïve truncation is that not every code point is a character (and I guess not every character is a glyph?). If you truncate the Unicode code point array at some unfortunate places like https://en.wikipedia.org/wiki/Ideographic_Description_Charac... , you'd just get gibberish or potentially very unintended results. (especially if you joined the truncated string with some other string)






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: