Partially copying a 2-byte character?

saagarjha · on July 15, 2024

UTF-8 works fine if you truncate a codepoint because the encoding scheme lets you detect this. The problem is more subtle than that (hint: it involves a 1-byte codepoint).

lalaland1125 · on July 15, 2024

Truncating a UTF-8 codepoint is not fine because most software is not tested with partially broken UTF-8 so international users will likely run into many bugs.

Especially because concatenation is a very common operation so those sliced codepoints will be everywhere, including in the middle of text.

saagarjha · on July 15, 2024

Morally I view “what do I do with my truncated string” to be a separate issue from “how do I truncate the string” as described in the article. Like, yes, you absolutely should not concatenate after doing this operation. But maybe you shouldn’t be showing the user a truncated string either even if it’s all ASCII. The question of “did you make an unparseable UTF-8 string” is answered with “no” and the more complicated but also more interesting question of “did you actually want this” remains unanswered.

Levitating · on July 15, 2024

This is fair, the article takes truncating a string to fit in a status bar as an example.

actionfromafar · on July 15, 2024

Also consider Unicode is not only international characters, but superscripts and other stuff ♥ᵃ

a: there was a list somewhere over which characters hackernews allows?

Retr0id · on July 15, 2024

If you're alluding to NUL, I don't really see the issue?

Yes, many languages allow strings (UTF-8 or otherwise) to contain null bytes, and C's str*() functions obviously do not, but null-termination vs not is an orthogonal issue to ASCII vs UTF-8.

i.e. Yes it's (depending on context) an issue that C str*() cannot handle strings with embedded null bytes, but that's not a UTF-8-specific issue.

Pesthuf · on July 15, 2024

A function that can turn a correctly formatted UTF-8 string into a malformed UTF-8 string is, in my opinion, broken.

saagarjha · on July 15, 2024

One problem here is that the string may not have been a correctly formatted UTF-8 string to begin with. No, not that it can contain any bytes-I mean, it might be ascribed even more than just decoding correctly. Maybe it is supposed to have the grapheme clusters preserved. Maybe the truncation should peel off the last file component because the string holds a path. The operation of “doing a dumb truncation” can be broken if you look at it from plenty of ways, and I don’t disagree with you, but I do want to make clear that the issue isn’t memcpy is breaking it but that if you need x, y, z maybe you’re reaching for the wrong tool. And conversely there is nothing inherently wrong with using it if you are going to use it in a way that is resilient to that kind of truncation.

account42 · on July 15, 2024

What about a function that can turn a correctly spelled english sentence into a malformed english sentence? If you truncate to a fixed length this comes with the territory.

MathMonkeyMan · on July 15, 2024

The null code point? That would be pedantic even by my standards.

saagarjha · on July 15, 2024

Look I had to include it or someone is going to do a whole pedantic comment about how C can’t actually represent UTF-8 correctly

kstenerud · on July 15, 2024

You could have just said it rather than going through this smug "I know something you don't know!" song and dance.

Also, by this rationale, NO string is ever safe in C, because pretty much every encoding technically supports codepoint 0 (even though you take your life into your hands should you ever use it). This is not a useful discussion.

saagarjha · on July 17, 2024

I mean strings that don't use that codepoint are fine, and that's most strings as I mentioned above.

Sebb767 · on July 15, 2024

Actually, by just alluding to the bug without saying it explicitly, you managed to both be pedantic and not avoid the discussion.

This is not meant as a personal attack; I just want to point out how it looks on a casual reading :)

saagarjha · on July 17, 2024

Well, I'm not perfect ;)

pdpi · on July 15, 2024

By that metric, C can't represent ASCII correctly either, because there's no particular reason you couldn't have a NUL character somewhere inside a string.

TeMPOraL · on July 15, 2024

Indeed it can't. Many developers were bitten by this, and still are; plenty of critical bugs and security vulnerabilities rely on this quirk too.

account42 · on July 15, 2024

Technically, C can. It's just C strings that are limited.

pdpi · on July 15, 2024

Sure, in the exact same way that C can handle unicode just fine too. The problem is, as always, C strings.

moralestapia · on July 15, 2024

>hint: it involves a 1-byte codepoint

Which is?