Hacker News new | past | comments | ask | show | jobs | submit login

> This is a solved problem.

Not … really. Yes, we "know" the solution, but the terrible APIs that compose so many language's standard string type goads the programmer into choosing the wrong method or type.

JavaScript has — to an extent — the excuse of age. But the language still really (to my knowledge) lacks an effective way to deal with text that doesn't involve dragging in third-party libraries. You are not a high-level language if your standard library struggles with Unicode. Even recent additions to the language, such as the recent inclusion of leftPad, ignore Unicode (and, in that particular example, render the function mostly useless).




> You are not a high-level language if your standard library struggles with Unicode

So C++, Lisp, Java, Python, Ruby, PHP, and JS are not high-level languages.

HN teaches me something new every day.


What would you say Java is missing? Sure, it does have the "oops, we implemented Unicode when they said we only needed 16 bits problem" but unlike, say, JS, it actually handles astral plane characters well (e.g., the regex implementation actually says that . matches an astral plane code point rather than half of one).

It does have all the major Unicode annexes--normalization (java.text.Normalizer), grapheme clusters (java.text.BreakIterator), BIDI (java.text.Bidi), line breaking (java.text.BreakIterator), not to mention the Unicode script and character class tables (java.lang.Character). And, since Java 8, it does have a proper code point iterator over character sequences.


I stand corrected. Java 8 has everything you could expect.


Python 3 does pretty good.


It does better than most, though Python 3 lacks grapheme support in the standard library, requiring developers to use a library like uniseg. I.e. it "lacks an effective way to deal with text that doesn't involve dragging in third-party libraries", and is thus evidently not a "high-level language".


How does Ruby struggle with unicode?


This is from a couple of weeks ago, there's a few things broken still, but what languages do have full support out of the box?

http://blog.honeybadger.io/ruby-s-unicode-support/



Swift


Go



Well, for one, I can't even write a portable unicode string literal.

> "\xAA".split ''

That works on a platform where my platform is UTF-32, but not one where it is UTF-8.


That is what I meant, there is an existing algorithm to do this because the author tried to come up with one. That JavaScript fails to provide an implementation, well, too bad, but this is of course a problem one may have to solve in any language.

And while other languages provide the necessary support at the language or standard library level, I would guess there are quite a few developers out there that are not even aware that they are looking for enumerating grapheme clusters. But now some more know and if they made a good language choice, it is now a solved problem for them.


It's not a problem that comes up that often, to be fair. Many of the cases where you think you'd need to split a string like that, you have some library doing the work for you. One of the main purposes is to figure out where valid cursor locations are in text editors... but in JavaScript, you just put a text field in your web page and let the browser do the heavy lifting. Same with text rendering... hand it off to a library which does the right thing.


I didn't know perfect unicode support in the stdlib was a requirement for being a high-level language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: