Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Partially copying a 2-byte character?


UTF-8 works fine if you truncate a codepoint because the encoding scheme lets you detect this. The problem is more subtle than that (hint: it involves a 1-byte codepoint).


Truncating a UTF-8 codepoint is not fine because most software is not tested with partially broken UTF-8 so international users will likely run into many bugs.

Especially because concatenation is a very common operation so those sliced codepoints will be everywhere, including in the middle of text.


Morally I view “what do I do with my truncated string” to be a separate issue from “how do I truncate the string” as described in the article. Like, yes, you absolutely should not concatenate after doing this operation. But maybe you shouldn’t be showing the user a truncated string either even if it’s all ASCII. The question of “did you make an unparseable UTF-8 string” is answered with “no” and the more complicated but also more interesting question of “did you actually want this” remains unanswered.


This is fair, the article takes truncating a string to fit in a status bar as an example.


Also consider Unicode is not only international characters, but superscripts and other stuff ♥ᵃ

a: there was a list somewhere over which characters hackernews allows?


If you're alluding to NUL, I don't really see the issue?

Yes, many languages allow strings (UTF-8 or otherwise) to contain null bytes, and C's str*() functions obviously do not, but null-termination vs not is an orthogonal issue to ASCII vs UTF-8.

i.e. Yes it's (depending on context) an issue that C str*() cannot handle strings with embedded null bytes, but that's not a UTF-8-specific issue.


A function that can turn a correctly formatted UTF-8 string into a malformed UTF-8 string is, in my opinion, broken.


One problem here is that the string may not have been a correctly formatted UTF-8 string to begin with. No, not that it can contain any bytes-I mean, it might be ascribed even more than just decoding correctly. Maybe it is supposed to have the grapheme clusters preserved. Maybe the truncation should peel off the last file component because the string holds a path. The operation of “doing a dumb truncation” can be broken if you look at it from plenty of ways, and I don’t disagree with you, but I do want to make clear that the issue isn’t memcpy is breaking it but that if you need x, y, z maybe you’re reaching for the wrong tool. And conversely there is nothing inherently wrong with using it if you are going to use it in a way that is resilient to that kind of truncation.


What about a function that can turn a correctly spelled english sentence into a malformed english sentence? If you truncate to a fixed length this comes with the territory.


The null code point? That would be pedantic even by my standards.


Look I had to include it or someone is going to do a whole pedantic comment about how C can’t actually represent UTF-8 correctly


You could have just said it rather than going through this smug "I know something you don't know!" song and dance.

Also, by this rationale, NO string is ever safe in C, because pretty much every encoding technically supports codepoint 0 (even though you take your life into your hands should you ever use it). This is not a useful discussion.


I mean strings that don't use that codepoint are fine, and that's most strings as I mentioned above.


Actually, by just alluding to the bug without saying it explicitly, you managed to both be pedantic and not avoid the discussion.

This is not meant as a personal attack; I just want to point out how it looks on a casual reading :)


Well, I'm not perfect ;)


By that metric, C can't represent ASCII correctly either, because there's no particular reason you couldn't have a NUL character somewhere inside a string.


Indeed it can't. Many developers were bitten by this, and still are; plenty of critical bugs and security vulnerabilities rely on this quirk too.


Technically, C can. It's just C strings that are limited.


Sure, in the exact same way that C can handle unicode just fine too. The problem is, as always, C strings.


>hint: it involves a 1-byte codepoint

Which is?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: