> *This is a solved problem.* Not … really. Yes, we "know" the solution, but the...

paulddraper · on March 10, 2017

> You are not a high-level language if your standard library struggles with Unicode

So C++, Lisp, Java, Python, Ruby, PHP, and JS are not high-level languages.

HN teaches me something new every day.

jcranmer · on March 10, 2017

What would you say Java is missing? Sure, it does have the "oops, we implemented Unicode when they said we only needed 16 bits problem" but unlike, say, JS, it actually handles astral plane characters well (e.g., the regex implementation actually says that . matches an astral plane code point rather than half of one).

It does have all the major Unicode annexes--normalization (java.text.Normalizer), grapheme clusters (java.text.BreakIterator), BIDI (java.text.Bidi), line breaking (java.text.BreakIterator), not to mention the Unicode script and character class tables (java.lang.Character). And, since Java 8, it does have a proper code point iterator over character sequences.

paulddraper · on March 10, 2017

I stand corrected. Java 8 has everything you could expect.

nathancahill · on March 10, 2017

Python 3 does pretty good.

paulddraper · on March 10, 2017

It does better than most, though Python 3 lacks grapheme support in the standard library, requiring developers to use a library like uniseg. I.e. it "lacks an effective way to deal with text that doesn't involve dragging in third-party libraries", and is thus evidently not a "high-level language".

nurettin · on March 10, 2017

How does Ruby struggle with unicode?

lucaspiller · on March 10, 2017

This is from a couple of weeks ago, there's a few things broken still, but what languages do have full support out of the box?

http://blog.honeybadger.io/ruby-s-unicode-support/

slobotron · on March 10, 2017

Perl6 https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...

paulddraper · on March 10, 2017

Swift

Immortalin · on March 10, 2017

Sean1708 · on March 10, 2017

Doesn't look like Go does: https://github.com/golang/go/issues/14820

paulddraper · on March 10, 2017

Well, for one, I can't even write a portable unicode string literal.

> "\xAA".split ''

That works on a platform where my platform is UTF-32, but not one where it is UTF-8.

danbruc · on March 9, 2017

That is what I meant, there is an existing algorithm to do this because the author tried to come up with one. That JavaScript fails to provide an implementation, well, too bad, but this is of course a problem one may have to solve in any language.

And while other languages provide the necessary support at the language or standard library level, I would guess there are quite a few developers out there that are not even aware that they are looking for enumerating grapheme clusters. But now some more know and if they made a good language choice, it is now a solved problem for them.

klodolph · on March 10, 2017

It's not a problem that comes up that often, to be fair. Many of the cases where you think you'd need to split a string like that, you have some library doing the work for you. One of the main purposes is to figure out where valid cursor locations are in text editors... but in JavaScript, you just put a text field in your web page and let the browser do the heavy lifting. Same with text rendering... hand it off to a library which does the right thing.

libeclipse · on March 10, 2017

I didn't know perfect unicode support in the stdlib was a requirement for being a high-level language.