It's a TOCTOU bug [1], a well known category of bugs. [1] https://en.wikipedia.o...

masklinn · on July 12, 2023

Also one more argument for “parse don’t validate”. The code validates a mutable input, and assumes that validation holds thereafter. An incorrect assumption as it turns out.

vacuity · on July 12, 2023

While that is good advice, it can't apply to mutable input. Unless a defensive copy is made (at least, under the current system), concurrent modification can still occur. The new type's underlying data is still being accessed.

masklinn · on July 12, 2023

There are necessarily copies being made since it's converting a `char[]` to a `byte[]`, the problem is that they're not done correctly.

Currently the code tries to encode the chars, and if it fails it completely bails out and restarts with a code unit copy. The bailing and restarting is what offers the opportunity for TOCTOU.

But if instead of bailing it converted the data collected so far to code units, then appended the code unit on which it failed, then switched to a UTF16 copy loop, the result would necessarily be correct (at least insofar as a UTF16 string would not contain just latin1)

And in fact this would likely be more efficient than the current version, because we already know that everything we've already converted is valid latin-1, which means we can literally just copy that to every other byte. There is no need to re-do that validation and conversion work. Which is currently the case, because StringUTF16.toBytes redoes the entire thing from zero.

vacuity · on July 12, 2023

Ah, good points. Thank you for the insight!