A bug-fix 12 years in the making: Windows Unicode Support in OCaml 4.06.0

scardine · on Nov 4, 2017

Less than acceptable support for Unicode or it being a second class citizen is the most frequent reason why I ditch some interesting "exotic" languages.

hfjdkare · on Nov 4, 2017

Note that the key word here is "Windows", not "OCaml".

Less than acceptable support for Unicode or it being a second class citizen is the most frequent reason why I ditch some popular operating systems.

abiox · on Nov 4, 2017

> Note that the key word here is "Windows", not "OCaml".

why is that?

fermuch · on Nov 4, 2017

Under "technical details" it goes into the why. What I understood is: ocaml tries to support pre-NT era strings, since that's what the console uses, but NT-era files use a different encoding, and ocaml internally uses UTF8 for everything. Since pre NT didn't handle file paths with unicode, they didn't either.

I'm not well versed in windows as to say if this was a clever way to solve the problem of console output, or if it was a really old hack that just now received some care and attention.

slrz · on Nov 5, 2017

I don't think you can write a standard ANSI C program on Windows that opens a file specified on the command line where the file name contains characters not representable in whatever legacy charset Windows is using at the moment. At least that's what the situation was for many years. The article hints at some UTF8-related changes in Windows 10.

For almost every other system, the obvious code (fopen(argv[1], ...), basically) does the right thing. On Windows you have to enter some crazy non-portable parallel universe where not even the signature of main() is the same.

That's the reason why many programs don't support Unicode on Windows, despite there often being no reason for those programs to care about character encoding at all.

pjmlp · on Nov 5, 2017

POSIX has zero support for GUI code or proper Unicode, of course it requires platform specific APIs, even if the target platform would be fully POSIX compliant.

kobeya · on Nov 5, 2017

Because “Unicode” on windows is UTF-16, which no one except windows supports these days.

gulbanana · on Nov 5, 2017

it’s also the native encoding of both javascript and java, two extremely common platforms

kobeya · on Nov 5, 2017

Not very relevant to OCaml though.

pjmlp · on Nov 5, 2017

And .NET.

grumpyprole · on Nov 5, 2017

The corollary to that is that these "exotic languages" are often such an improvement over the mainstream offerings, that such inconveniences have not prevented adoption in many cases.

ernst_klim · on Nov 5, 2017

What do you mean by that? A few langs support UTF-{8,16LE,16BE,32LE,32BE} + normalization out of the box: rust, c++, python, go, they all require library for that.

equalunique · on Nov 4, 2017

Very detailed article.