I've lightly advocated for a while that emoji shouldn't be part of the Unicode s...

philplckthun · on Oct 29, 2019

I think this works great for apps like Slack, in user-land so to speak, but isn't realistic for the Unicode standard, not only because these entry formats are in English.

Modifiers and combinators aren't exclusive to emojis, but apply to all kinds of glyphs in other languages and writing systems as well. Arabic script even has some common ligatures for common expressions.

A lot of complexity simply doesn't stem from emoji in Unicode, a lot of the complexity comes from all the writing systems that Unicode supports. Admittedly, emoji are kind of an oddball addition to Unicode, but they're by far not the most complex part of it.

derefr · on Oct 29, 2019

And even in Slack/IM apps, custom emoji codels only “work” because people aren’t often trying to 1. interoperate with external services using, or 2. parse archived logs of, arbitrary message text.

If either of these were common (e.g. slack bots that tried to parse semantic meaning from regular text rather than responding to commands; or Slack logs of OSS communities being public-access on the web) then you’d see a lot of people up-in-arms around the fact that these custom codels are used.

But since text in these group-chat systems is private, ephemeral, and mostly a closed garden, it never bubbles up into becoming an issue anyone else has to deal with.

(Though, on a personal note, I wrote my own VoIP-SMSC-to-Slack forwarder because Slack is a much better SMS client than any of the ones built into VoIP softphone apps, and I’m irritated every day that Slack auto-translates even Unicode-codepoint-encoded emoji from a source postMessage call, into its own codels in the canonical message stream. I don’t want to send my SMS contact “:thumbs_up:”, I want to send them U+1F44D!)

danShumway · on Oct 29, 2019

Does Unicode make this any better?

Let's say Matrix supports custom emoji, and it wants to interoperate with Discord.

Does representing those user-added emoji as custom Unicode entries instead of plain text make it easier or harder to interop with Discord?

derefr · on Oct 29, 2019

Think of Unicode like HTML. What’s better for interoperation and machine-readability: a custom SGML entity (like you could use up through HTML4); a custom HTML tag; or a normal HTML tag with an id/class attribute that applies custom CSS styling?

One way to encode a ‘custom emoji’ would be encoding it as a variation of some existing emoji. Use an as-yet-unused variation-selector on top of an existing emoji codepoint, and then “render” that codepoint-sequence on receipt by the client back to an image (but in a way where, if you copy-and-paste, you get the codepoint-sequence, not the image. In HTML, you’d use a <span> with inline-block styling, a background-image, and invisible content.) This is pretty much what Slack was doing with the flesh-tone variation-selectors, before Unicode standardized those. But you can do it for more than just “sub”-emoji of a “parent” emoji; you can do it to create “relatives” of an emoji too, as long as it’d be semantically fine in context to potentially discard the variation selector and just render the base emoji.

Or, if your emoji could be described as a graphical (or more graphical) depiction of an existing character codepoint, you could just use the “as an emoji” variation-selector on that codepoint.

Or, rather than a variation-selector, if you have a whole range of “things to combine with” (i.e. the possibilities are N^2), you could come up with your own private emoji combining character for use with existing base characters. The “cat grinning” emoji U+1F639 could totally have been (IMHO should have been) just a novel “face on a cat head” combining-character codepoint, tacked onto the regular “face grinning” emoji codepoint. Then you could have one such combining-character for any “head” you like! (And this would also have finally allowed clients to explicitly encode “face floating in the void” emoji vs. “face on a solid featureless sphere” emoji, where currently OSes decide this feature arbitrarily based on the design language of their emoji font.)

And, I guess, if all else fails, you could do what Unicode did for flags (ahem, “region selectors”), and reserve some private-use space for an alphabet of combiner-characters to spell out your emoji in. That way, it’s at least clear to the program, at the text-parsing level, that all those codepoints make up one semantic glyph, and that they are “some kinda emoji.” Custom-emoji-aware programs (like your own client) could look up which one in a table of some kind; while unaware programs would just render a single unknown-glyph glyph.

I don’t suggest this approach, though—and there’s a reason the Unicode standards body hasn’t already added it: it’d be much better to just take your set of emoji that you’re about to have millions of people using (and thus millions of archivable text documents containing!) and just send them to the Unicode standards body for inclusion as codepoints. Reserving emoji codepoints is very quick, because the Unicode folks know that the alternative is vendors getting impatient and doing their own proprietary thing. Sure, OSes won’t catch up and add your codepoint to their emoji fonts for a while—but the goal isn’t to have a default rendering for that character, the point is to encode your emoji using the “correct” codepoint, such that text-renderers 100 years from now will be able to know what it was.

So, please, just get your novel emoji registered, then polyfill your client renderer to display them until OSes catch up. Ensure your glyph is getting sent over the network, and copy-pasted into other apps, as the new Unicode codepoint. Those documents will be correct, even if the character doesn’t render as anything right now; if the OS manufacturers think the character is common (i.e. if it ever gets used in text on the web or in mobile chat apps), they’ll provide a glyph for it soon enough. And, even if the OS makers never bother, and you’re stuck polyfilling those codepoints forever, there’ll still be entries in the Unicode standard describing the registered codepoints, for any future Internet archaeologists trying to figure out what the heck the texts in your app were trying to communicate, and for any future engineers trying to build a compatible renderer. (Consider what Pidgin’s developers went through to render ICQ/AIM emoji codels. You don’t want to put engineers through that.)

paulddraper · on Oct 29, 2019

> So, please, just get your novel emoji registered

Yeah I'm not gonna register my company icon.

WorldMaker · on Oct 29, 2019

> A lot of complexity simply doesn't stem from emoji in Unicode, a lot of the complexity comes from all the writing systems that Unicode supports.

Yes, it's not that emoji are doing anything odd compared to lots of real world languages, it's that emoji are just latin script writers' "first"/"only"/"most likely" interaction with that sort of stuff. The fascinating bit that that if it weren't for emoji a lot of these problems would still go unfixed in a lot of real languages, but because emoji are fun and everyone wants to use them we've seen a lot of Unicode fixes brought about by emoji that's a rising tide to lift other Unicode boats.

frandroid · on Oct 29, 2019

Some subcontinental scripts have ligatures between vowels and consonants in almost every word so that ship has sailed anyway.

wodenokoto · on Oct 29, 2019

> Unicode was designed to provide code-point-by-code-point round-trip format conversion to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation [1]

The problem is that emoji are part of a major text-encoding format, so unicode needs to adopt those. Once those were added, everyone and their mother suddenly wanted their emoji in unicode as well.

> It feels like extra complexity for no benefit other than, "we need a standard".

That sums up the whole standard for better and for worse.

[1] https://en.wikipedia.org/wiki/Unicode#Mapping_to_legacy_char...

derefr · on Oct 29, 2019

> we need a standard

That’s exactly the benefit. Think of Unicode as “Archive.org for the semantics of text codels.” Every time someone invents a text codel (like those examples you gave, where Slack invented their own text codels), Unicode takes the semantics behind that codel and standardizes their own codepoint equivalent to it, so that Unicode documents will have a way of encoding that text codel at the Unicode-text level.

If Unicode doesn’t do this, then people have to use other encodings on top of Unicode to specify their text; they come up with incompatible encodings of the same semantic characters; and suddenly we’re back to having to create code-pages to specify what set of incompatible codels each text stream is using. It also means that we go back to having to create OS-specific, or GUI-toolkit-specific rendering encodings to translate those code-pages into a text-layout-system specific “normalized” encoding; and thus, we also go back to having OS/GUI-toolkit specific fonts.

> Emoji aren’t programmatically generated.

No, not usually, but they can totally be programmatically consumed. They’re machine-readable! The modifiers allow emoji to be “structured text” in the ML sense. Since there’s one codepoint that always means “sad face”, an algorithm can attach a meaning to that codepoint apart from its modifiers, to do e.g. sentiment analysis. It’s much harder to learn when you have 100 different “sad face” codepoints; let alone when different documents use different incompatible encodings with different codels to refer to the same “sad face.”

> better support for custom emoji

That’d be like saying a dictionary should better support private in-jokes. Unicode, like a dictionary, watches for when things become common or important to define, and then defines them. Until then, it considers them “someone’s attempt at injecting a gibberish in-joke into language.” In-jokes don’t need an entry; but as soon as an in-joke becomes a word, it does. Because people don’t look up in-jokes, but they do look up slang (= ascended in-jokes), so if you want your dictionary to be useful, you’d better give them a definition for slang terms.

Actually, that’s a very good equivalence: emoji are to Unicode as slang terms are to dictionaries. In both cases, people think it’s ridiculous that the authors of the text would include them; but in both cases, the usefulness of the work would be hampered if they didn’t.

MauranKilom · on Oct 29, 2019

> having emoji just be a purely clientside rendering feature

Please, please, PLEASE don't do that.

Have you ever tried pasting any kind of code (especially C++) into Skype? It always turns into a fiasco of stupid emojis. And even if you could disable it on your client side (which you can't even do), it would turn up as emoji garbage on the receiving end - and you might not even know/realize!

Just thinking about "let's show everything that's might be an emoji as an emoji" makes me shudder...

I'd also note that "you can alias it for different languages" is bound not to work. The same word means different things in different languages - which ones do you accept as aliases? How are native speakers supposed to know when their word for something can't be used because it means something else in any other language? I mean, look up the German adjective that means "fat"...

munmaek · on Oct 29, 2019

I don't think emoji is the worst of unicode.

The major issue I find with it is there aren't limits to how many are added. Emoji are growing at a fantastical pace, much like a tumor, because "if X got added, why didn't Y?", and so on.

As the other child reply mentioned, this is latin-centric. Totally useless for people that can't or don't want to speak English, and having to duplicate it for every language defeats the entire purpose.

danShumway · on Oct 29, 2019

> and having to duplicate it for every language defeats the entire purpose.

Does it?

If you want accessible emoji, you need a display label for each emoji in every language that the client supports. Whether or not the emoji is represented in Unicode doesn't change that.

Also when users are entering emoji into a client application, they need a way to quickly filter and get to the emoji that they want -- that requires having a label in their native language, and putting emoji in Unicode doesn't fix that either.

In any practical setting, accessibility/input means you need multiple labels for different languages anyway. So why are we trying so hard to avoid them in the final text representation? If :cat: :gato: :貓: were part of the emoji definition in a big list somewhere, it would only make it easier to support multiple languages, since I wouldn't need to compile my own list of translations/labels.

munmaek · on Oct 29, 2019

> Also when users are entering emoji into a client application, they need a way to quickly filter and get to the emoji that they want -- that requires having a label in their native language, and putting emoji in Unicode doesn't fix that either.

This is different from -forcing- people to memorize the specific identifier in order to input it.

How do I find a specific face emoji if I don't know the name? I use the system's emoji picker tool and simply scroll through it. On OSX, it shows me recently used ones, which suffices. I don't really have to think about the names at all.

danShumway · on Oct 29, 2019

> use the system's emoji picker tool and simply scroll through it.

I don't see how this would change. You'd pick the cat emoji exactly the same way you do right now, and since your system language is set to English, iOS would insert :cat:, then immediately render it to an image representation.

Users wouldn't need to memorize a label any more than they currently need to memorize the Unicode positions.

paulddraper · on Oct 29, 2019

Emojis are a pictogram language. [1]

Regardless however, they aren't really the hard thing for text editing.

Half the world use CJK.

[1] https://en.wikipedia.org/wiki/Pictogram