For what purposes would you use this?

deathanatos · on July 23, 2014

Any purpose. Just about anything that "iterates over a string", in my opinion, is going to be more correct when considered at the code point level than at the code unit level. Some concrete examples:

- Truncating a string, perhaps to append "…" to it. I definitely don't want to truncate in the middle of a surrogate pair or two UTF-8 code units into a three code unit sequence. In fact, code points isn't good enough here either, as I don't want to clip an accent if someone has the string ("e" + ACUTE_ACCENT).

- Determine the width of a string, for something like columns, or pixels. Code units are meaningless. Especially in regards to terminal columns, this problem comes up often. Are you rendering the output of a SQL table in a pretty ASCII table? Then you need to append spaces to a short string until you reach the "|", and far too many times I've seen this:

    | 1 | Some text in here. | another column |
    | 2 | résumé           | another value  |

MySQL does this. Here, the rendering is broken because the function thought an accute accent took a column. I've seen both the need for something higher than code points (this example) because some code points really take no space (typically, because they combine), and the example that multi-byte UTF-8 strings just got byte counted. I've had a linter (pylint) tell me a line was too long — over 80 columns — when it was in fact closer to ~70 columns, and well under 80. The problem? `if 80 < len(line)`, where line is a bytestring.

My point is that I struggle to come up with an example that is more conceptually valid at the code unit level than at the code point level. Because of this, this — code points - is the _bare minimum_ level of abstraction language designers should deliver to programmers. As many of my examples show, even that might be woefully insufficient, which makes it all the worse to have to work with code units.

roel_v · on July 23, 2014

Well yes, that code points aren't enough as your first example demonstrates, is part of my point.

Where I'm coming from is this: about a week ago I thought the same as you do. Since then I've researched on how to move a big MBCS codebase (Visual Studio speak for 'std::string = array of char's, mostly) to Unicode (i.e. Microsoft-speak for 'use UTF-16 encoding for strings internally, and call all the UTF-16 API's, rather than the char-based ones which expect 8-bit strings encoded in the current code page'). (I'm just explaining because I don't know if you have experience with how Visual C++ and Windows handle these things).

My conclusion is that the people at utf8everywhere.org propose the least wrong approach, which is to use std::string for everything in C++, and assume that the encoding is utf-8.

What is the relationship to this discussion - well, when you do the above, std::string::length() doesn't return the number of 'characters' any more, just the size in bytes. So does iteration - it iterates over bytes.

Why do I think this is not as big a problem as I though it was a week ago: the circumstances where you need to iterate over 'characters' (note that 'character' doesn't really mean anything; what you need in your examples isn't iterating over code points, it's iterating over grapheme clusters) are few and far between.

Neither of your examples will be fixed when the string would let you iterate over code points. What you need instead, is a way to ask your rendering engine (albeit in this case, the console): 'when you render this sequence of bytes, how wide will the result be'. (in the units of your output device, be it a fixed-width output device like a console, or a variable-width one when rendering in a GUI.)

On the other hand, when working with strings, what you do often need is the size in bytes: for allocating buffers, for letting others know the 'size' of the string (which they need to know, at the least, in bytes), etc.

So: - you always need to know the 'size' in bytes, and that has a fixed meaning. - you very seldom need to know the 'size' in code points, or iterate over it. For the cases where you do, you can use an external library. (of course it would be convenient if that functionality was baked in, but where do you draw the line? code-point level access is very rare) - you sometimes need to know the 'size' in grapheme clusters, but there is no fixed definition of 'size' in that context; it's something that depends on your output device. No string class or type can account for that. At best you can say 'I want to know the amount of units if this string were rendered in a fixed-width font', which is sensible, but not only very complex but also asking (imo) too much of a string representation that is to be used in today's programming languages and library ecosystems.

So while I feel your pain, I think what you're asking is not realistic nor even very useful today. At some point in time, 10 or 20 years down the road where full 'unicode' support of all software everywhere is the default, maybe yes.

(as an aside, when reading up about this the last week, I looked at e.g. the Swift string API - https://developer.apple.com/library/prerelease/ios/documenta... - and felt a bit jealous. So this is probably a first start towards the bright future I mentioned above, but it still has some weirdnesses nobody used to 'char = 1 byte, only 7-bit ASCII allowed' strings would expect.)

deathanatos · on July 31, 2014

I agree with you, but I think it proves my point.

I completely agree that it would be best if I could just ask "how wide is this string?", and not worry about code points or grapheme clusters at all. That'd be amazing. But I can't. So, I need to iterate over grapheme clusters, but I can't. So, to do that, I need to iterate over code points, but I can't. So, to get around that, I have to decode code units to code points manually, then build all the way up the aforementioned problems. Each time you encounter these problems. It's a PITA, because Unicode support is so piss poor in so many languages.

Or I'm a user, and the experience is just poor because the coder couldn't be bothered to do it right, most likely because it's so difficult.

To some extent, I'm sure there are libraries (is there a library for terminal output width?), but often, it's coder ignorance that results in them not getting used. There's be more awareness if the API forced you to choose the appropriate type of sequence: you'd be forced to think (and maybe seek a library to avoid thinking). Instead, the default is often wrong.

> On the other hand, when working with strings, what you do often need is the size in bytes

And for this, I'm thankful. Most of the time, it doesn't matter. But when it does, you're in for a world of hurt.