Not true at all! Extended grapheme clusters are defined by Unicode for a reason ...

mlochbaum · 2025-02-10T19:36:56 1739216216

I think it's worth considering that application development and GUIs really aren't K's thing. For those, yes, you want to be pretty careful about the concept of a "character", but (as I understand it) in K you're more interested in analyzing numerical data, and string handling is just for slapping together a display or parsing user commands or something. So a method that lets the programmer use array operations that they already have to know instead of learning three different non-array ways to work with strings is practical. Remember potential users are already very concerned about the difficulty of learning the language!

dzaima · 2025-02-10T19:13:47 1739214827

I meant the combining mark point as a thing you would want to cut off; a 50-char chopped-off "summary" of a thing should not include a character with ten thousand combining marks ever. Of course it'd be preferred to cut to cut before and not in the middle, but certainly not after, which is what you'd get if taking the first 50 extended grapheme clusters, the 20000-byte glyph counting as one. Point being, you still just want to use a library that has properly thought out the question. And that applies to most (all?) sane fully-Unicode-aware operations.

Places where ASCII-only is a known expectation and there are meaningful per-char operations are plenty; that's what using a list of bytes provides. Indeed you'd probably want to use another abstraction if you have non-ASCII. And for such you could use something to do the form of iteration or operation you want just fine, even if the input/output is a list of byte-chars representing plain UTF-8.

James_K · 2025-02-10T19:30:15 1739215815

Well in that case, the way you get a 50 char summary is by iterating grapheme clusters, then counting up to 50 points and discarding the broken cluster. It's quite trivial if the language exposes an interface for iterating both clusters and points, and without such an interface the problem is much harder to notice. Hence why the language shouldn't prefer clusters to points or points to clusters. It should expose all relevant representations without prejudice.

Even if ASCII is appropriate in some situation, this should be stated within the program. Requiring people to be explicit about the data they produce and consume is important and useful. A user might decide that UTF-16 best serves their need (or be working on the Windows platform) in which case code which works with strings as linear sequences will be able to operate on their strings without issue. Code which assumes a UTF-8 byte representation will require an the entire string to be allocated, converted, then reallocated and converted back. Huge overhead and potential incompatibility for no reason.

dzaima · 2025-02-10T19:49:39 1739216979

> It's quite trivial if the language exposes an interface for iterating both clusters and points, and without such an interface the problem is much harder to notice

I assure you, 99% of people won't handle this correctly even if given a cluster-based interface (if they even bother using it). And this still doesn't handle the question of cutting words in the middle of some languages resulting in broken display of the non-cut part (or languages without space-based word boundaries to cut on). So the preferred thing is still to use a library.

I don't think anyone in k would use UTF-16 via a character list of 2 chars per code unit; an integer list would work much nicer for that (and most k interpreters should be capable of storing such with 16-bit ints; there's still some preference for using UTF-8 char lists, namely, such get pretty-printed as strings); and you'd have to convert on some I/O probably anyway. Never mind the world being basically all-in on UTF-8.

Even if you have a string type that's capable of being backed by either UTF-8 or UTF-16, you'll still need conversions between those at some points; you'd want the Windows API calls to have a "str.asNullTerminatedUTF16Bytes()" or whatnot (lest a UTF-8-encoded string makes its way here), which you can trivially have an equivalent of for a byte list. And I highly doubt that overhead of conversion would matter anywhere you need a UTF-16-only Windows API.

I doubt all of those fancy operations you'll be doing will have optimized impls for all formats internally either, so there's internal conversions too. If anything, I'd imagine that having a unified internal representation would end up better, forcing the user to push the conversions to the I/O boundaries and allowing focus on optimizing for a single type, instead of going back-and-forth internally or wasting time on multiple impls.