You don't even need emoji; eg U+63,U+300 (c̀) is one characters but two code poi...

speleo_engr · on May 30, 2019

You example is a little confusing from the use of the word "characters". I think glyphs would be more clear (U+1C4 is two glyphs). Though it might not actually be 2 glyphs, its dependent on how the font implements it.

At the end of the day, an OpenType font through substitutions can do far more crazy things than these "double glyph" examples. I once made a font that took a 3 letter acronym, substituted this for a hidden empty glyph in the font, then substituted this single hidden empty glyph into 20 glyphs. You were left with something like your U+1C4 in software where you could only highlight all 20 glyphs of none of them. And this was happening on text where all input code points were under 127. People often don't realize how much logic and complexity can be put into a font or how much the font is responsible for doing.

tialaramex · on May 30, 2019

"Squiggles" is my preferred word here. There are a bunch of technical terms like Glyph and Codepoint and Grapheme, but I find squiggles are often what somebody wanted when they used something that works on "characters" and are disappointed with the results.

Advice elsewhere is correct. You almost certainly don't want anything in your high level language other than strings. No "array of characters" no slicing, all that stuff is probably smell. They're like the API in many crypto libraries for doing ECB symmetric encryption. 99% of people using it would have been better off if the Stack Overflow answer they're about to read is explaining what they should be doing instead.

a1369209993 · on May 31, 2019

No, "characters" is the correct term; U+1C4 is two characters: latin-capital-d followed by latin-capital-z-with-caron (or whatever you want to call the little v thing). As you note, this means that non-buggy fonts will generally use two glyphs to render it, but that's a implementation detail; a font could process the entire word "the" as one glyph, or render "m" as dotless-i + right-half-of-n + right-half-of-n, but that wouldn't affect how many characters either string has.

speleo_engr · on June 1, 2019

Using characters is confusing because I don't know if you mean before or after shaping. U+1C4 is unquestionably a single Unicode code point. I've heard people call this 1 logical character. Other people might say how many characters it requires for encoding in UTF-8 or in UTF-16. After shaping, some people might say it is 1 or 2 "shaped characters". It's all horribly confusing. I find using the term code point more precise.

a1369209993 · on June 1, 2019

There's no shaping involved; I'm not talking about the implementation details of the rendering algorithm. There is a D, followed by a Ž. This only seems confusing because Unicode (and - to be fair - other, earlier character encodings) willfully misinterprets the term "character" for self-serving purposes.