Hacker News new | past | comments | ask | show | jobs | submit login
What's new in Unicode 9.0 (babelstone.blogspot.com)
101 points by ingve on Jan 4, 2016 | hide | past | favorite | 76 comments



We managed to get Power Symbols into Unicode (http://unicodepowersymbol.com/) and it was all sparked by a conversation on HN https://news.ycombinator.com/item?id=6828102


That's really awesome. Thanks a lot for this, that is actually a need I came across when working on some desktop environment features.


Does anyone know if there are plans to incorporate "powerline" symbols, a la https://github.com/powerline/powerline? Or are those already Unicode compliant, and simply missing from most font libraries?


Powerline symbols are private use codepoints.

I think it might be better this way. You need a font that supports them anyway, and they're not intended to convey anything in the absence of such a font, so it doesn't matter that they're private use.

If the symbols got official Unicode codepoints, switching to the official codepoints would break all existing Powerline-compatible fonts. And if something better than Powerline came along it would have to happen all over again.


I understand that it would be backwards-incompatible, however it would be nice to make it official, considering it's been proven useful.

It wouldn't break existing Powerline fonts, however, because those private-use codepoints would still work as long as they're not assigned to anything else.


This is still exactly what the private use area is for: "I want a code that tells this particular font to show this particular symbol, and nobody will need to know what this code meant 20 years from now".


Anyone scratching their head over the reasoning behind the ballooning number of emoji might enjoy reading this post by the same author: http://babelstone.blogspot.co.uk/2015/04/whats-new-in-unicod...


  The question is, now that there is a mechanism for defining skin tone
  colours for Unicode characters, will this be enough? Or will users
  demand a similar mechanism to specify hair colour and eye colour? And
  will users want to expand the concept of modifier characters to cover
  any colour and any Unicode character?
I find the thought absolutely horrifying


The beer glass emoji is filled with lager, I want to specify amber ale or stout color. We need beer modifiers!


Why?


Because it's encoding writing systems, not graphics systems. There is a case for Emoji, but frankly of you want to change colours of a font then let the metadata of the document do it.


Add modifiers for proportions and it'll turn Unicode plaintext into a general-purpose drawing language.


> Add modifiers for proportions and it'll turn Unicode plaintext into a general-purpose drawing language.

Not even that: it's all special-purpose stuff with narrow use cases.


I think Unicode has already enough features to address any (x,y) point with some precision (think Zalgo) - now add colors and scales, and soon someone will start encoding pictures in plaintext (and I don't mean ASCII art)...


The most amusing quote in all of this is buried in the [Submitting Emoji Character Proposals][1] documentation linked to by that:

    There is a misperception that such petitions play a large role in selecting
    emoji. For example, the commercial petitions for TACO played no part in its
    selection, because there was no evidence of reliability.
"Enough with the Internet petitions already!!!"

[1]: http://www.unicode.org/emoji/selection.html


That was really interesting and gave me some good insight into how new characters even work. Thanks!


One upside is that after 9.0 is released he can update that post with U+1F926.


I'm currently refactoring LibreOffice code around font handling, and I have to say text is complicated. I've had to do a lot of reading about Windows, Unix and OS X font and text handling, and to be honest I really think so far that Apple has the cleanest platform for handling text.

I could be wrong about this, and happy to be challenged (actually, I welcome it with some reasoning because I'm still getting my head around all the systems fully). Anyway, just an aside.


I'd just like to ping a thanks for the effort you put in.

Libreoffice is pretty popular in India as a teaching tool.

Just wanted to draw your attention to two points: Libre is impossible to pronounce in Asia. Which is why people are not able to Google for you (how do I tell my friend to Google for X).

It would be great if you could take Google Noto fonts into account. There are several vernacular language users in India who would love to switch over to Linux.

Emoji ! In fact, its not a bad idea at all to consider LibreOffice to have an independent font installer (like the way Atom editor does plugins)


The work I do is tiny compared to the other developers :-) I'm really getting more out of contributing to the LibreOffice code base than The Documentation Foundation gets from me! But thank you, it's always great to hear that someone appreciates our efforts.

It's interesting to hear that LibreOffice is being used in Indian education - in my opinion Indic scripts are by far the most advanced and complex writing system, even more so than Arabic and East-Asian scripts!

In terms of fonts, we use Graphite, Pango and Harfbuzz in an attempt to wrangle out font handling. We've got some serious text layout issues (look for the comments in the code around DXArray!) but the basic guts are there. It's a complex issue and you can see how over 30 years there has been many, many stabs at getting international fonts and text handling right. I think I have so far counted 10 classes that just handle fonts... And have recently removed one (FontInfo) that didn't seem to actually do anything!

So refactoring such an old module in the project might take some time, and also bare in mind I'm (sadly) a monolinguist so mainly at the moment I'm trying to streamline class hierarchies, get some saner class interfaces, merge classes where appropriate, review the font mapping code and try to understand how we deal with three very different approaches to font and text handling on OS X, Windows and the rest of the Unix world...

In terms of Noto, I think it would be unwise of us to bundle a font installer. It's not a bad idea, but LibreOffice UI code is currently tightly coupled to Star's VCL which was at one point a cross platform visual component library that could stand on its own: that has now gone by the wayside and the LibreOffice team is currently trying to wrangle it into a more stable and responsive framework. But it's quite hard to make another program out of it (sadly).

Another reason is that the distributions are actually better off bundling the fonts themselves via their package management systems, and for Windows and OS X I really think I'd suggest embedding the fonts you want to use in the document themselves if that's the concern.

As for the name: unfortunately that won't change now. Blame that on Oracle, who decided not to give the TDF the trademark and instead gave it to Apache, where the code is currently bitrotting away :-(


The best thing imho would be having distros include the noto fonts by default.


"What are letters?"

"Kinda like mediaglyphs except they're all black, and they're tiny, they don't move, they're old and boring and really hard to read."

-- The Diamond Age (Neal Stephenson)


Those emojis, including the taco, will be there for generation after generation of people to ponder and behold. The taco seems like a pretty transient thing to put into something as important as unicode.


IMO the entire emoji-in-unicode idea is ridiculous - it's a hack around the inability of allowing arbitrary markup across platforms. But, we're stuck with it because Apple said "we're doing this" after seeing Japanese phone makers do it with arbitrary encodings last decade and the entire rest of the industry caved, completely without regard for how much worse it makes text rendering, processing and layout engines (to which they say "you can just not support it", and instead turn Twitter and the rest of the web into mojibake).

But that's just my opinion on the matter. I'm certain I'm wrong and they're actually just the best thing since sliced bread since apparently every woman I've ever met loves the damned things.


I'm torn on this. The best part is that the emoji are at least getting standardized. I remember the bad days of conveying this by either relying on images or fonts that may not be installed on the target system; especially a cross platform problem. Now one may be relying on Unicode code points that may not be _supported_ by the target font but that feels better to me than the options.

What strikes me as odd is that they're adding so many. Sure, it makes sense that Unicode could be kind enough to support conveying happiness or sadness in a message, but "Person doing cartwheel"...? Any alphabet has boundaries for the letters, I think the emojis should have received a boundary too. They are a standardizing organ, why not... Just standardize it? In advance. Treat emojis like a language of emotions. If not we'll soon enough (if not already...) end up with the same problem as before. That just because of the sheer size of the emoji space and the effort needed to design a font for them, we can no longer rely on them all being implemented correctly, which means they are no longer useful as a standardized set of symbols.

If you've been there from the start, like Apple, sure then it's no big deal. Just a few new symbols per year to implement. If you're designing a new font to support emojis, you'll risk needing a whole team of designers just for this silly subset of symbols alone. There's also a user interface problem mounting as for how to even pick and organize them all.


I think it's actually pretty neat that they are regular unicode characters, as they then can be applied in any text field, making for fun stuff like using it in Address book contact names or calendar event names or even the terminal. That wouldn't work if markup was required.


I think that's exactly his point.

That we have an entire industry where humans in the past could do anything in freeform (pen and paper) we are now limited to representing that as text only. While our human-needs for adding extra "stuff" is still there.

If we had an industry-wide "standard" text-representation (with associated UI-widgets and controls) which had the ability to include more than just text, this wouldn't be a problem and we wouldn't need to standardize each new "symbol" we want to represent as "text" in our applications through Unicode.

Using Unicode for this certainly feels like inappropriate piggy-backing and a giant hack.


Creating a "standard" text-representation UI widget and markup that works universally sounds like a much larger engineering effort compared to adding some code points to an existing standard, with a much larger room for errors and differences in the implementations. Perfect is the enemy of... something that (already) works :)


What you said is pretty much textbook definition of short-sightedness :). Perfect is the enemy of good, therefore in things that are actually important (like critical infrastructure) we should not stop when we reach "good enough" :). Or, worse is sometimes better, but usually it's just worse.


> Creating a "standard" text-representation UI widget and markup that works universally sounds like a much larger engineering effort compared to adding some code points to an existing standard

However, it would arguably be far more useful.

> Perfect is the enemy of... something that (already) works :)

...for creating cute text messages and not much else.


You can't read "anything in freeform." If I scribble a bunch of sworls on a page, that's not going to be communication.

And we can still do that today. Go to the store and buy a pen and some paper.


> That we have an entire industry where humans in the past could do anything in freeform (pen and paper) we are now limited to representing that as text only.

So, how do I get pen-and-paper notes from Montana to New Mexico in less than a minute? It's still within one country, no wars are going on in the region, should be easy, right?


Scan them and email them.


Fax?


There are idiograms used by only a few hundred people, and some that are no longer used except in academic circles included in Unicode. Why would you not include idiograms that are currently in use by millions if not billions of people?


That might be true for the heart, emoticons and several others, but I'm not sure if that statement applies to the taco.

Where does it end? Very curious how it's determined when something is worthy of inclusion. At what point is unicode no longer an appropriate place to store graphics?


Where are you located? I'm in the south western US and it's extremely clear that that taco is not going away anytime soon. And idiograms in the Unicode set do not have to be universal to be included, e.g. CJK chars are not used in the western hemisphere.


Given that Unicode already encodes things like U+1F365 FISH CAKE WITH SWIRL DESIGN, I think we can squeeze in the taco as well.

https://codepoints.net/U+1F365?lang=en


Only if it goes through the relevant processes. Look at the nightmare they have around racial and gender lines with some of the human characters. Had it gone through better, then it wouldn't have been a problem.

Frankly, I'd like to know the names of the people in these ad hoc committees who approve changes to such an important standard. They need to be made accountable.


The Industry determines what is worthy of inclusion in this Industry Standard. If you want to interoperate with Facebook, you need all the Facebook codes. If you want to interoperate with Line, you need all the Line codes, and Wechat, WhatsApp, Ebay, Alibaba and even Github. You want a Taco code point because the data you are being fed contains Tacos, and you want to know it is a Taco rather than a USER_CODEPOINT+XXXXX. And if you don't care about Tacos, you can ignore it because you know it is a Taco and not something important or corruption. And you can just pass it on down the line to some other process that does care about Tacos.


I'd argue that emoji are going to be culturally short lived and don't record information in the same way that language characters do. It would be better to move them into a separate standard.


Why do you think they'll be short lived? Emoji can be more expressive than pure text, and they have a tendency to take on a meaning that can't quite be described in words. For example, a quick thumbs-up is a great way to show you acknowledged something.


A few basic emoticons - fine. But like a third of emojis are only legible if you know their name (under the particular program you're using, and most of the icons in popular applications are simply ugly; that's a topic for another rant though). I can understand FACE WITH TEARS OF JOY when I see it, but INFORMATION DESK PERSON, TYPE-4? Is this going to be useful 20 years from now? And then you have stuff like FACE WITH LOOK OF TRIUMPH where the picture has little resemblance to the official meaning, which leads one to wonder what your interlocutor meant and how is this icon displayed on their machine.


Wikipedia says that the taco is at least 500 years old, I'm not sure they were hard shell tacos but the description in unicode just says 'taco'.


I've been dying for a modern pentathlon emoji. My life is complete.


The emojis here makes me sad in multiple levels.

Most of all makes me sad because they seem to be prioritised on a higher level than some of the oldest written languages on the planet.


> Most of all makes me sad because they seem to be prioritised on a higher level than some of the oldest written languages on the planet.

1. is there any evidence for that claim or are you just making things up as you go?

2. which scripts are specifically blocked because of emoji "being prioritised"?

Keep in mind, Unicode 9 includes 74 new emoji and 7227 non-emoji codepoints including 4 new scripts.


I cannot see how they are "prioritised", I mean, does accepting them somehow delay accepting others?

Or, do you mean that you would expect something to be already in unicode that is not there yet, while PILE_OF_POO is ? Seems to me it's mostly a matter of championing said languages, and noone has done it yet.

(I do not know or understand the reason for adding more emojis, but it's not like we'll run out of codepoints soon)


That makes me happy. Codifying tools that a wide range of people find useful for communication today should be a higher priority than serving the extremely specialized needs of a few historians.


Emoji are easy. Codifying languages is hard.


I'd been bracing for the inclusion of fistbump, selfie, and "talk to the hand" emoji for some time now. Now I can finally text in peace.


I wonder what's the expected lifetime of SELFIE though. Will it even be a thing in 20 years?


> Will it even be a thing in 20 years?

I'd expect emojis to go out of style before selfies. Selfies are just a quick, easy, and ubiquitous way of getting a photo of yourself (which people have wanted since the beginning of photography). They're just taking off now because the technology that allows them (front-facing phone cameras) has become widespread.


But you can't send the new emoji over basic SMS, because SMS, uses a variant of UTF-16 from the era when people thought 16 bits was big enough. (So do Java and Windows, although there are hacks in both to get past 2 bytes.) The new emoji are all up in the astral planes, beyond 2 bytes.


> because SMS, uses a variant of UTF-16 from the era when people thought 16 bits was big enough

SMS uses 7-bit by default. https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al...

Hacks? It's called UTF-16 surrogate pair. Not hacks. UCS-2 officially, but UCS-2 has been legacy for almost 20 years now. Modern phones select the encoding dynamically and use UTF-16 if the message cannot be encoded otherwise. You'll get up to 160 7-bit characters (some symbols take two characters) or 70 UTF-16 code units in unicode mode.

I just wish everyone started to use UTF-8 already and dropped all the other nonsense.


Emoji has always been beyond 2 bytes. The Unicode spec also includes "surrogate pairs" which allows a higher plane code point to be represented as 4 bytes.


There are a few 2-byte emoji:

0x2639 Frowning face

️0x263a Smiling face

(Hacker News doesn't speak much Unicode; the Unicode symbols won't pass through.)


So what happens if I put a standard smiley in a message? (Grinning face Unicode: U+1F600, UTF-8: F0 9F 98 80)

Edit: I see, it disappears.


Hmm. Does HN use MySQL with "utf8" encoding as backend storage? ;)


for those not getting the joke: mysql has a thing called "utf8" which is not in fact utf8 and will (depending on settings) either truncate text when it meets a 4+ byte character, or raise an error.

It also supports real utf8 in more recent versions calling it "utf8mb4"


This made filing a bug about Thunderbird not sizing astral plane code points correctly slightly more hilarious than it should have been (Mozilla's Bugzilla instance runs on MySQL)...


Everywhere I've used mysql, the default has been to silently truncate data that doesn't fit in the "utf8" encoding! :(


If that were the case, astral characters would be removed (e.g. shavian or the emoji block) but BMP symbols like U+263F MERCURY or U+262C ADI SHAKTI would be left alone, and they're not: "", ""

Since U+A420 YI SYLLABLE JJUOX goes through (ꐠ) HN likely explicitly strips most symbols (but not all, U+00B6 PILCROW SIGN (¶) is unmolested)


I'm curious the real value of supporting dead languages. The number of texts in said languages are presumably no longer growing.

There's no harm in it, particularly in further out plains, just my knee jerk reaction is "why?".


I can't understand how you don't see the value in supporting ancient and dead languages with precise full text search, enabling lossless reproduction and alternative representations through printing, web sites, etc. Why wouldn't we want to try to keep our pre-digital records alive by immortalizing them digitally?


So it's been mentioned one of the big benefits here is reproduction and text storage to allow searching, etc. Another biggie not mentioned, however, is newly found texts. Yeah those languages are probably not growing and maybe nothing new is even being produced in languages we bring into Unicode but that still leaves the possibilities of finding texts previously undiscovered or simply not archived previously.

Regardless, I would prefer to have all characters of all languages, dead or alive, in Unicode so I can handle any new or old data that data set requires without modification.


Some of us like to read those ancient documents online in the original form, not in the multiple translations of it, some of them not 100% like the original meaning.

Our culture goes all the way back to the first days stories were told around fire in caves, in cold winter nights, before writing was a thing.

Everyone should strive to preserve mankind's culture.


aren't searchable, plaintext copies of ancient documents worthwhile? what about learning materials?


I need that pancakes emoji.


I love pancakes! If they're going to add anymore foods then it should be pancakes damn it :)

In all seriousness I'm very curious to see how far this will go. Do we really want these graphical glyphs in Unicode or would they be more suited for another encoding to separate them (but perhaps provide a way of interoperability?). Honestly I'm out of my wheel house trying to think of ideas of doing this better so I don't really know.


> but perhaps provide a way of interoperability?

There's the rub. Unicode is a fine way to provide interoperability, has been since at least Unicode 1.1 (which added — amongst many others — U+263A WHITE SMILING FACE or U+25EE UP-POINTING TRIANGLE WITH RIGHT HALF BLACK) and probably 1.0 (but I can't be arsed to look up 1.0's symbolic codepoints)


Mmm avocado!


More emoji, seriously? Is this what the future of tech standards looks like? When is this validation-hungry-teen-catering-ego-fest going to stop...

I'm getting really sick of the direction mainstream tech (which is I guess driven by mainstream culture, or lack there of) is going these days.


Even worse: Kickstarter campaigns for Emoji e.g. https://www.kickstarter.com/projects/657685639/where-is-the-...

You should read the section titled "Who Controls Our Emoji?" if you want to get a little bit angrier.


Pictograms have been with us for years. Hieroglyphs, Chinese/Japanese all have their roots in drawings of real life items. They're easy to understand and can frequently convey more meaning in a single glyph than a dozen conventional words.


I don't think unicode should be used to convey pictographs. I understand the need to cover writing systems such as hieroglyphs, but adding in randoms pictographs (like emojis) and permutations (like skin colors) seems like too much complexity for what is expected of unicode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: