Another way of fixing it would be, whichever character caused 'StringUTF16.compress' to fail (we'd need to return it), make sure that character is kept in the final string (could just assign it at the end, remember it's value and location earlier).
By merging in the code of StringUTF16.compress and StringUTF16.toBytes into one function (and they are both very small), this wouldn't even have to slow things down, once you noticed location 'i' has char 'c' which isn't LATIN1, you make the new buffer, copy 0..i-1 over, then your known non-LATIN1 'i', then i+1..len.
This might be very slightly slower, but would fix (I think) the problem of breaking interning, which feels quite painful.
Clever! I like this! I feel a bit jealous that I didn't think of it :)
It's not quite as easy as it seems, because these methods are "intrinsics". That is, the pure Java code you see is not always the one that gets used; instead, the JIT compiler can use a faster implementation that uses vectorized assembly or whatever. (That's why you see "@IntrinsicCandidate" on compress() and toBytes() in StringUTF16.java)
But I think your idea would also be possible to implement in vectorized assembly, so it still works!
The rule for most things (such as ArrayList) in the JDK is: "if you use race conditions to break this thing, that's your problem, not ours". But String is different: it's one of the few things meant to be "rock-solid, can't break it even if you try", so I think this bug in String would qualify as a potential security issue in their mind: there are many places in Java that trust Strings not to act weird, and some of them are even in native code deep in the guts of the VM.
On the other hand, String is used all over the place so having to introduce a performance regression to fix this bug would be quite painful. All of the other proposed solutions in this thread introduce an extra copy, and an extra pass over the string. Your fix is basically no extra cost, and as a bonus, can be tweaked so each char in the array is read only once.
Which means that the bug can now be fixed "guilt-free", if anyone from the JDK team is reading this thread. Though they have some pretty clever people there too, so they might have thought of it eventually for all I know.
By merging in the code of StringUTF16.compress and StringUTF16.toBytes into one function (and they are both very small), this wouldn't even have to slow things down, once you noticed location 'i' has char 'c' which isn't LATIN1, you make the new buffer, copy 0..i-1 over, then your known non-LATIN1 'i', then i+1..len.
This might be very slightly slower, but would fix (I think) the problem of breaking interning, which feels quite painful.