Hacker News new | past | comments | ask | show | jobs | submit login

It seems way nicer to have a language that treats strings as byte arrays and use libraries to handle encodings than to have a language that treats strings as UCS-2 and use libraries to handle UTF-8 strings that live inside of UCS-2 strings.



I don't know, it has been a pain to work with strings in any language that does the above, and has seldom (if ever) been a problem with Java, Go, Swift, or even modern Python 3, and so on...


I don't get what's so hard about it. Most of my programs deal with UTF-8 simply by doing memcpy(). Parsing code just loops over the bytes and compares to ASCII characters (0-9, A-Z, a-z, \n, \r, \t ...). That's how UTF-8 was designed to be used.


I assume you never had to deal with unicode normalization ?

When you send your unicode string to an external system (for example a storage server with a database) and latter retrieve the string, only to find out that it has been normalized differently so it no longer match byte-for-byte what is stored in your program, and all of a sudden strcmp no longer works.

Or all kind of weirdness like that because every system outside of your program will handle unicode differently, and you will need to adapt to them, and having a string library to do most of the heavy-lifting will avoid the need for every user to rewrite a library from scratch.

Not to mention that for every developer which rewrite unicode handling functions, you will probably end up with a function with subtly different behaviors from others, which aggravate the problem for others when they will try to communicate with your system.


> I assume you never had to deal with unicode normalization ?

I hadn't, and as long as I control the data I'm displaying, I won't have to.

> Or all kind of weirdness like that because every system outside of your program will handle unicode differently

Blame those systems, not me.

What you suggest is surrendering to the state of affairs, which we collectively self-inflicted.

When I have to deal with normalization issues and have to interface with external systems, I can still go looking for a library if I don't feel like implementing it on my own (which is likely).

But unless I need to do normalization, I'm way worse off with a complicated library than with just doing memcpy() or using a simple decode_codepoint() routine.


> Blame those systems, not me.

Yes I completely agree with you, and if you don't need it, any unicode handling library is overkill and add more headaches than simply handling utf-8 string as byte arrays.

I just wanted to insist on the fact that some people will have to deals with theses kind of issues. And these issues are self inflicted, but it gets worse every time someone try to reinvent the wheel or rely on byte array when they shouldn't.

Having a standard library in the language make the issue less worse: the core of the language still handle only byte arrays, and for the cases where it's not enough you still have only one library so you don't add your own subtly different mishandling of the standard by implementing your own.

So memcpy is fine, but that's about it: for example, please don't use strcmp when you need to sort data alphabetically and please don't try to reimplement the standard algorithms designed for that, otherwise you will be part of the problem.


Can you really fix anything by changing the native string type? You'll inevitably need to exchange bytes with different systems that demand different encodings and different normalization forms.


I'm not sure how to explain the difference in experience here. Python 3's built-in string encoding support has been a source of endless pain to me, and Lua's belief that strings are byte arrays has been much easier to deal with.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: