> a way to not allow byte slice functionality on a thing that is clearly not a b...

lmm · on Dec 21, 2021

Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation. Treating strings as byte slices exposes a lot of footguns; it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.

von_lohengramm · on Dec 21, 2021

Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented, but fair enough. The unknown input may be slightly malformed. Complexities like this is why one shouldn't underestimate the nuances (and runtime costs!) of implementing proper Unicode. As for representing byte sequences as byte sequences, that is the most basic way to represent strings of text without placing any assumptions on them. It's the assumption of potentially incorrect invariants that's the issue. If you have faculties to handle Unicode correctly (and very few languages do), then using something more opaque may be better fitting than a byte slice.

lmm · on Dec 21, 2021

> Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented

Not necessarily the shortest (NFC means not using composed characters from later revisions of the standard), and you only get a normalised representation if you've actually normalised it - if you've just accepted and maybe validated some UTF-8 from outside then it probably won't be in normalized form. IMO it's worth having separate types for unicode strings and normalized unicode strings, and maybe the latter should expose more of the codepoint sequence representation, but I don't know if any language implements that.

turminal · on Dec 21, 2021

> it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.

That's a nice analogy.

> Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation.

Unfortunately that's nearly impossible to do sanely in the general case, no matter how the string is represented.

Firadeoclus · on Dec 21, 2021

I'm curious, what would be a good reason why treating floating point numbers as byte sequences should be any harder than what is required to make it obvious (provided their binary format is well defined)?

lmm · on Dec 21, 2021

There are footguns in making that representation easy to access, e.g. if you try to hash the byte sequence to use floats as hash table keys then it will almost work but you'll get a very subtle error because 0 and -0 will hash differently. And frankly most of the things you'd do with the byte sequence are things that there are more semantically correct ways to do. There should be a way to access that representation but it shouldn't be something you'd stumble into doing accidentally, IMO.

turminal · on Dec 21, 2021

You are talking about what stringy things can be done with byte slices and I'm talking about all the byteslicy things that shouldn't be done with strings.

Like subslicing. And accessing individual bytes in it.