In part it is motivated by my experience in Delphi of moving from a so-called An...

shiro · on Nov 28, 2009

The question is, having that many string types are inherent for string handling? Or are they more of residue of historical artifacts?

Some data, represented in lists, may have extra constraints. If elements have meaning in its order relative to each other, reversing such a list doesn't make sense. If every elements need to be power of 2, putting 3 into such list doesn't make sense. Yet, having constraints doesn't mean they need to be a distinct type from lists. You can implement them in a general lists and handle constraints with libraries. Why do strings need to differ? You don't want to reverse string character-by-character, or arbitrary indexing into a string may not make sense, but yet a general list operations like fold, filter, take-while, etc. may be useful.

Now, to avoid further confusion, let's separate CES and CCS. Utf-8, utf-16, ucs-4 are all CES variations of Unicode CCS. They can be converted freely without loss of information, and there are no reason that implementation can handle it implicitly, except performance issues.

If you need extra constraints, such as ascii-only or length-limited, you can create a specialized type wrapping the basic string and implement constraints there. That can be done in user-level and doesn't require such special types to be built-in. E.g. if a legacy library only accepts ascii-only string, it takes an argument of type AsciiOnly [Char]. Check can be enforced at type coercion.

Composed characters are headache, but the library to deal with them can be built on top of list of unicode codepoints (if we adopt unicode codepoints as Char). What's important here is that you can build some algorithm, say normalization, on top of list-of-char view. If you, as a language user, want to try out a new normalization scheme, you can do that, and you can have all the list manipulating tools in your hand. And you can wrap the resulting normalized string in a specialized type if you want to keep integrity.

Incompatible CCSes are more of a problem; e.g. there are several different conversions between Unicode and other Japanese character sets, and that have caused loss of information. I wish I can write CCS-neutral code, but eventually I need to deal with differences. This could be handled by having distinct UnicodeChar and JISChar, maybe.

Actually, there is a more fundamental question: What is a character? Depending on an application you may want different unit as a character. I think that's a good argument against "list-of-character" view. So another way is to have an opaque string type, which can show different level of abstraction depending on what an application asks (e.g. it may return graphemes, fully composed characters, unicode codepoints, etc.). That is plausible. Although I argue that it can be implemented as a library on top of list of basic characters (e.g. Unicode codepoints).