There are multiple ways of counting "length" of a string. Number of UTF-8 bytes,...

toast0 · on March 9, 2017

> Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.

If you have a limit on the length of a field, it helps to tell the user what it is in a way they understand. For non-technical users, bytes (and the embedded issue of encoding) and code points are both pretty esoteric, but number of symbols is less so. OTOH, SMS has strict data and encoding limits, and people managed with that; also provisioning byte storage for grapheme limited fields is hard: some graphemes use a ton of code points, family emoji and zalgo text are clear examples.

paulddraper · on March 10, 2017

Why do you have a limit on the length of a field?

So it can fit in a database, i.e. with a certain number of bytes?

Sean1708 · on March 10, 2017

If that's why you have a limit then please go and change that immediately.

No, this post is talking about having a minimum length on the password for safety reasons (i.e. a limit on the minimum entropy). You're right that a minimum byte length will ensure this, but what happens when your user types in n-1 "things" but their password gets accepted anyway. That's only a minor thing but (and I'm not entirely sure whether this is possible) what about when your user types in n "things" but the password doesn't get accepted because it's actually only n-1 bytes. Now the password won't be accepted and the user has no idea why.

I agree that these are relatively trivial things, but the point is that it's not as simple as "just use the byte length".

toast0 · on March 10, 2017

Some limits are technical (and in that case the hard limit is often bytes, but sometimes code units or code points, or broken if you told MySQL utf8 instead of bytes or utf8mb4), but in many cases, the limits are for aesthetic purposes: a post title or a username often is often required to be fairly short to look nice; in an ascii or latin1 world, those limits are usually expressed in terms of characters, but graphemes might be the right thing to limit in a unicode world.

kmill · on March 10, 2017

"Your username must be 1-4cm when printed with 12pt Times New Roman."

I kind of like the idea of minimum length in cm as a password requirement.

martin-adams · on March 10, 2017

What about "Your username must be no longer than 3 seconds when spoke out loud"

or, "Your username must not take more than 0.001ml of ink when printed at 12pt"

desdiv · on March 10, 2017

Without a limit on password length, an attacker can DOS you by forcing you to run your KDF on gigabyte-sized strings.

paulddraper · on March 10, 2017

Giga byte sized strings?

Oh, no. That doesn't make sense. You need to limit by Giga grapheme strings.

geocar · on March 10, 2017

They're only denying service to themselves if you run the KDF locally.

jfoutz · on March 9, 2017

It's a lot like equality. Same pointer? Same value? p and q point to different nodes in a circular list. Does p equal q?

Semantics matter a lot.

kmill · on March 9, 2017

To expand on this point, one resolution to the Ship of Theseus problem is that the point at which the ship stops being the "same" ship depends on how you are going to define "same." "Same" could mean different things depending on what you are trying to do, so this isn't just an it's-just-semantics cop-out. In particular, to borrow something Ravi Vakil once said, a definition is worthless unless it has a use (which in his case, as a mathematician, if it can be used to uncover and prove a theorem). This is what I have in mind: I do not think it is worthwhile to worry about "the true length of a Unicode string" unless there is something you could do if only you could compute it, and I've been trying to think of something but have come up short.

Speaking of equality: in a lecture about logic I once gave, I asked the students whether {1,2} and {1,2} were the same. In a very real sense, they are different because I drew them (or typed them) in different places and slightly differently -- I promise I typed the second {1,2} with different fingers. But, through the lens of same-means-same-elements, they are the same. That is a warmup for {1,2} vs {1,1,2}, and {1,2} vs {n : n is a natural number and 1 <= n <= 2}.

(There's also kind of a joke about how my set of natural numbers might be red and your set of natural numbers might be blue, but the theory of sets doesn't care about the difference.)

kccqzy · on March 10, 2017

That forgotten use might be a special string sorting algorithm such as LSD. Or it could be a trie but input strings have many common prefixes.