A Rust String is encoded as UTF-8, it is not an array of char-s. You can get an ...

jiggunjer · on July 23, 2020

Don't most other languages use utf16 for strings?

lambda · on July 23, 2020

It's fairly complicated in other languages.

Languages that were developed before the advent of Unicode, like C and C++, frequently leave the encoding of strings up to the system; but this can be complicated, because there are a number of different legacy 8 bit encodings, along with legacy encodings that don't fit into 8 bits. So they have two character types, char and wchar_t; char is a type that can always encode at least 8 bits, and wchar_t is an implementation defined wide character type, which is 16 bits on some platforms and 32 bits on others.

When Unicode was first designed, and released as Unicode 1.0, it was envisioned as a 16-bit universal encoding, with the expectation that you would use it just like 8 bit character encodings but with 16 bit character types instead. This encoding is known as UCS-2.

However, experience and unification of the Unicode standard with ISO 10646 showed that 16 bits would not be sufficient for all of the characters in all of the world's writing systems; in order to fit CJK characters into that 16 bit encoding, a large effort was made to unify characters from different languages that represented basically the same thing but might be written slightly differently in Traditional Chinese, Simplified Chinese, Korean, Japanese or the historical Vietnamese writing system. Additionally, a lot of decisions had to be made about inclusion of historical or obscure characters, and so there were many characters which couldn't be encoded properly in early Unicode.

This led to Unicode 2.0 introducing a surrogate character mechanism, now known as UTF-16, in which certain unused ranges of code points were used in pairs to represent a wider range of code points, in order to expand Unicode from a single 16 bit plane of characters to 17 16-bit planes.

Meanwhile, Ken Thompson and Rob Pike adapted an earlier 8-bit encoding of ISO-10646 called UTF-1 into UTF-8, which had a number of desirable properties; it was a superset of ASCII, the interpretation of the ASCII subset of bytes never depended on state, and it was self-synchronizing, so if you started reading in the middle of a multi-byte character, you could always find the start of the next character.

This meant it was possible to use UTF-8 in all of the traditional APIs and file formats which were compatible with 8-bit extensions of ASCII; which meant that you could migrate to Unicode support without significant changes to APIs and file formats. Also, for text which consists mostly of the ASCII subset, which includes many ASCII-based markup languages, it was much more efficient than UTF-16.

In the early days, however, many languages and systems which adopted Unicode started doing so via UCS-2, and then with the advent of Unicode 2.0 and surrogate pairs, needed to adopt to UTF-16. This was done somewhat inconsistently, and it can cause a lot of pain and confusion.

I've already probably written more than is necessary, so I won't go into the history of each system, but a quick survey of a few popular platforms and languages and their native string encoding:

* C and C++: Support 8-bit unspecified encoding strings (char * , std::string), unspecified width wide characters (wchar_t * , std::wstring, which are 16-bit UTF-16 on Windows and 32-bit UCS-4 on most other platforms), and explicit widths (char8_t * , std::u8string, char16_t * , etc...) which are explicitly UTF-8, UTF-16, and UCS-4.

* Windows uses 16-bit wchar_t * , nominally in UTF-16 but not enforced, as its native string type, and also provides 8 bit APIs with character encoding which can vary at runtime

* Linux and other Unix systems generally use 8-bit char * in an unspecified encoding (which can vary at runtime) as their native string type; most systems these days are using UTF-8 as that 8-bit encoding, but it's not guaranteed

* Java, JavaScript, and C# all use UTF-16 as their native encoding, though in many cases there are still APIs which make most sense for UCS-2 rather than UTF-16.

* Python 2 used unspecified 8-bit encodings for the str type, and a custom encoding that supports all of Unicode for the unicode type. Python 3 changed so that the bytes type was 8-bit with unspecified encoding, and str type was Unicode in a custom encoding; a bytes type alias for str was added to Python 2 to ease the transition, as was a unicode type alias for str in Python 3

* Go and Rust use UTF-8 as their native string types

* Most text-based file formats, such as HTML, have standardized on UTF-8 as the backwards compatibility with ASCII is a big benefit for such formats.

There are a few references on why UTF-8 is generally a better choice for languages and APIs than UTF-16, despite some of the legacy use of UTF-16 in languages and APIs:

* http://utf8everywhere.org/

* https://www.cl.cam.ac.uk/~mgk25/unicode.html