Hacker News new | past | comments | ask | show | jobs | submit login

When adopting a new niche language, with hardly any following, and frequent changes, and not even an 1.0, like Zig, "which version of UTF8" (as if that's an issue) is the least of your worries...

"Which third-party strings lib of several half-complete incompatible libs" will be a much realer concern...




For systems programmers the answer to "which third-party strings lib" is probably "None, write your own that fits with the rest of the system". A ready-made lib will be a lot of work to fit in - consider choice of internal encoding, allocation, hashing, buffering, mutable operations, etc....

Assuming that you really want to use UTF-8 internally, which is probably a sensible choice, the reusable part of a string library is basically the UTF-8 encoded/decoder. A useful implementation of UTF-8 is about 100-200 lines, I could probably rewrite what I use in an hour or two without an internet connection. The rest of the work is integration stuff that doesn't make sense to put in a library IMO. The idea of a string library fits much better with garbage collected and scripting languages (which includes C++ with RAII mechanism, but consider that std::string and similar often cause bad performance).

Many programs, in particular non-graphical programs don't need any UTF-8 code at all - UTF-8 handling is basically memcpy().


> Many programs, in particular non-graphical programs don't need any UTF-8 code at all - UTF-8 handling is basically memcpy().

argv to main is utf8 on my system.


On Unix/Linux, it is binary data without any restrictions except that each argument is zero-terminated (the typical argument is probably UTF-8 if you have set a UTF-8 locale). You'll see exactly the bytes that were put in as arguments to execlp() et. al. by the parent process.

On Windows, I believe it is Unicode converted to current codepage.

In any case I don't need to care about it since I can simply treat arguments as ASCII-extended opaque strings as described.


> argv to main is utf8 on my system.

That sounds totally compatible with programs that don't know anything about utf8. Do programs need to normalize the utf8 you pass in before using it as an argument to open(2) or something?


How feasible would it be to defer string processing to the operating system so that the behavior of all software running on it is the same? Perhaps a new OS interface could be defined for this purpose using syscalls on Linux. At the very least, there should be one canonical set of algorithms per operating system, rather than everyone downstream reinventing the wheel. Please forgive me if this sounds absurd, I am not a low-level programmer.


I certainly don't want to pay for a syscall to do string encoding.

Also, a lot of the time the problem isn't that people are using fundamentally incompatible string libraries, but that there isn't one correct answer to the question they're asking and they chose different ways to convert the question into code. A reasonable question to ask is "How many extended grapheme clusters are in this string?" The answer is "It depends on what font you plan to use to render it." Not great!

Some programmers would still like to e.g. write a reverse() function that returns ":regional_indicator_f::regional_indicator_r:" unmodified (because it is the French flag emoji) and returns ":regional_indicator_i::regional_indicator_h:" when given ":regional_indicator_h::regional_indicator_i:". If such people want to avoid having nonsensical behavior in their programs, the only solution available is to decide what domain the program should work on and actually deal with the complexity of the domain chosen.


Because string processing is not slow enough? Making string operations eat the performance impact of that context switch in and out of kernel is not a good idea. Library is better. This is not string but I'm thinking to some language like Rust where the "time" crate is not language feature but just about standard and all the other library use it. This is possible if a library like it is good quality and exist early in the language.


Zig's C interop is pretty good though, and there must be some decent native Unicode library out there somewhere right? ;)

I've worked on a full-duplex file synchronization system that had to support cross-platform operating systems and file systems, across a variety of Unicode normalization schemes (and versions [1], which is why I introduced the question), and I'm personally satisfied that baking this into the language specification would be a mistake.

[1] For example, depending on the file system, there's simply no way to get the normalization right unless you reverse engineer the actual table they're using, or probe the file system to do the normalization for you.


Would Bellard's libunicode[1] work?

[1]: https://github.com/bellard/quickjs/blob/master/libunicode.h




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: