I'd say all of the above objections don't fix any of the actual issues (which re...

mikeash · on Sept 9, 2019

It doesn’t matter how well you understand Unicode, it is impossible to compute “the length of a string” because there is no single metric that means that.

It’s like measuring “the size of a box.” I want volume, you want maximum linear dimension, UPS wants length plus width plus height. The problem with “the size of a box” isn’t that it’s hard to measure, it’s that it doesn’t exist. Imagine your favorite language has a Box type with a “size” property. What does it return. How likely is it that the thing it measures is the thing you want?

Of course it was never a problem for ASCII, because ASCII is structured to make most of the measurements people care about be the same value. I want bytes, you want code points, he wants “characters,” doesn’t matter, they’re all the same number.

ASCII is also incapable of representing real world text with good fidelity. It’s inherently impossible to remedy that while maintaining a singular definition of “length.” If you value length measurement over fidelity, you can keep using ASCII.

coldtea · on Sept 9, 2019

>It doesn’t matter how well you understand Unicode, it is impossible to compute “the length of a string” because there is no single metric that means that.

That's already covered in my comment: ("[the problem] stems from there being multiple, similar, metrics one wants").

Whether we call all of them length or a specialized name is not the real problem.

The real problem is you need to know what you want of each, and some of them (e.g. regarding normalization, decomposition, and so on) can be hard to grasp.

In 99% of cases people either want to know "how many bytes", or "how many discreet character glyphs of final output" (even if they have combining diacritics etc).

mikeash · on Sept 9, 2019

It's really rare to care about the number of glyphs. That's not something you can answer for a string in isolation, anyway; it depends on the font being used to render it. The only code that would care about this would be something like a text rendering engine allocating a buffer to hold glyph info.

I suspect you mean the number of grapheme clusters, which is Unicode's attempt to define something that lines up with the intuitive notion of "a character." This is basically a unit that your cursor moves over when you press an arrow key.

However, it's pretty uncommon to want to know the number of grapheme clusters too. Lots of people think they want to know it, but I struggle to come up with a use case where it's actually appropriate. An intentionally arbitrary limit like Tweet length is the best I can think of.

"How many bytes" is ambiguous. Do you mean UTF-8, UTF-16, UTF-32, or something else?

There are a lot of different ways to answer the question, "how long is this string?"

You did mention similar metrics, but you then went on to say that the objections don't make sense and that the actual problem is that length for a unicode string is difficult to calculate.

My point is that the difficulty of calculating a length is not the problem. It's annoying, but people have written the code to do it and there's rarely any reason to write it yourself. Just call into whatever library and have it do the work. The problem is that you have to know what kind of question to ask so you can make the call that will actually give you the answer that you need. And that is not the sort of thing that can be wrapped up in a nice little API.

tomcam · on Sept 9, 2019

> I struggle to come up with a use case where it's actually appropriate.

TCP packet size is a very real thing for me.

Also in forms it helps to know, for example, that no one has entered more than 2 characters for a USA state abbreviation.

Definitely open to bigger ways about thinking of these things though.

mikeash · on Sept 9, 2019

How is calculating the number of grapheme clusters relevant to TCP packet size?

As for state abbreviations, is $7 a valid abbreviation? Is XQ? You’d probably want to validate those against a list of known good ones.

tsbinz · on Sept 9, 2019

For which use case is "how many character glyphs" the measure you you want to know?

jpttsn · on Sept 9, 2019

Super curious about why UPS wants l+w+h. Any detail you can link?

damienkatz · on Sept 9, 2019

https://www.ups.com/us/en/help-center/packaging-and-supplies...

mikeash · on Sept 9, 2019

Apparently I misremembered slightly: it's actually width + 2 x length + 2 x height.

That link doesn't seem to explain the why, but my understanding is that it's just a decent heuristic for the general difficulty of handling packages as they go through the system. Volume wouldn't be appropriate, because a really long, skinny box is harder to handle than a cube of the same volume.

saghm · on Sept 9, 2019

How do you decide which dimension is the width versus the length? I assume height is often significant for packages containing things that shouldn't be turned upside down, but length versus width seems pretty arbitrary. Is width just assumed to be the shorter of the two dimensions?

mikeash · on Sept 9, 2019

Length is defined as the longest side. The other two sides are interchangeable so pick what you like. This measure doesn’t appear to account for packages that require a certain orientation.