Hacker News new | past | comments | ask | show | jobs | submit login
Breaking java.lang.String (wouter.coekaerts.be)
220 points by coekie on July 11, 2023 | hide | past | favorite | 198 comments



Is this actually a bug? The default assumption in Java is that types are not thread-safe unless otherwise specified. Attempting to use types in a way that exceeds their documented thread safety has always been allowed to leave your program in an inconsistent state.


Author of the blog here. Yes, this is a bug. While the javadoc doesn't state it explicitly, immutable classes in the core library are expected to be thread safe. Java tries to be and mostly succeeds at being a safe language, where (by default) the guarantees of its internals cannot be broken no matter what user code does. The JVM preserves its own integrity.

There are some other deliberate holes in that safety, such as using reflection to access private members, and instrumentation, where it is clear you are stepping outside the safety zone, but even that is getting locked down now with the "Integrity and Strong Encapsulation" effort https://openjdk.org/jeps/8305968 .

In general, most code we write would not (and should not) try to protect against such abuse, but classes in the core platform play by different rules.

Btw, this has been accepted into the bug database now: https://bugs.java.com/bugdatabase/view_bug?bug_id=JDK-831190... . I expect this to be fixed in a future release.


Dunno, this seems like you're violating the java memory model and then then fully expected weird shit happens. If you're mutating a shared state between threads without proper synchronization, and there is no way for String to do this on its own, the CPU is permitted to do stuff like out-of-order execution as there is no happens-before relationship between the write and the read. The JVM is further permitted to optimize the code with the assumption that the data stays the same in the absence of volatile/synchronized/other synchronziation primitives, as long as the observable outcome is the same "as if serial" execution.

The understanding 'It first tries to encode the characters as latin-1 using StringUTF16.compress. If that fails, it returns null and the constructor falls back to using UTF-16. ' is incorrect in unsynchronized code. The reality is that it's permitted to do all these things at the same time in any order (or speculatively do both at the same time) as long as the observable consequences are the same. This relies on the assumption that the data stays the same. If you violate that assumption, you get bizarre and often unpredictable behavior.

I don't think this is a bug in String. It may be a security vulnerability in the JVM, in a log4j-esque "it's working as we intended but holy crap did we ever not intend on this interaction" manner.


I didn't think it was a bug either till I got to the "Spooky action at a distance" section. The fact that the following code:

"hello world".startsWith("hello")

can return false in any circumstance, is a bug in the language. The fact that some other code in an entirely unrelated part of the codebase can intern a broken string and thus break string-comparisons for the entire codebase, is mind boggling.

Fortunately, I don't think the fix needs to be too expensive. After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents. If it doesn't, that would imply a race condition, and the broken string instance should not be returned


> After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents.

Allocating new String objects might very well be the most frequent memory allocation operation in the JVM. IMO it would be a mistake to do anything to slow this down to protect against this weird instantiation behavior, unless it implies a security vulnerability which can't be fixed by guards at the interning step.


The language maintainers have already decided to slow down string-construction by calling StringUTF16.compress(value) which checks if the input can be converted into LATIN1 and if so, creating a LATIN1 byte-array representation from the ground up.

Compared to this, I wager that the incremental cost is small to verify that the coder matches the content.

Also, depending on the implementation of string-interning and StringUTF16.compress, it's possible that the verification step is only needed for non-LATIN1 strings. Which according to the JEP is only a small minority of strings seen.

If there is a cheaper solution, I'm certainly all for it. I'm just spitballing ideas here. But I don't think it is acceptable to allow "hello world".startsWith("hello") to return false.


I don’t think you need any defensive copy, just to have compress switch to utf16 mode when it encounters a non-latin1 char and go on from there.

It would likely be faster since currently the implementation restarts the entire conversion from 0 in utf16 mode.


Making the 99% case faster at the cost of making the 1% case slower, is a speedup.


A defensive copy probably doesn't help here. The JVM is permitted to optimize it away, since it has no externally observable effects and doesn't have a constructor.


Wait, really? It's very externally observable in the presence of multithreading. Surely there must be a way to say "no, really, please make a copy".


afaik the person you replied to is wrong, using System.arrayCopy would definitely make a copy and not be optimized away.

Maybe poster meant only copying the reference with `char[] copy = original` which indeed does nothing, but that's not a defensive copy.


This copy has no external observable effects (it's only temporarily used in the constructor and never stored anywhere), and since the JVM is (under the JMM) permitted work under the assumption the original array doesn't change unless the thread itself changes it or there is some happens-before relationship with something that does, and is permitted to re-order same-thread execution as-if-serial, it's fully permitted to assume the original array is identical to the copied array, and it may elide the copy, since the original array isn't stored or mutated in any way.

Whether the JVM actually does this for arraycopy is an implementation detail. The JVM is permitted to perform this type of escape analysis and elision, and definitely does for some allocations.

In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle. You avoid most of these types of problems by having threads talk by passing around immutable objects via concurrency primitives, volatile references, synchronized-blocks, etc.


You are incorrect - it is allowed to behave as if the original didn't or did change, that is true and in the memory model. But once the copy is done, then the state is settled, and is no longed allowed to change at any point. There are no more conflicting accesses.

It is "fully permitted to assume the original array is identical to the copied array" only to the extent that affects reading from the original array, not the copy (ie, it can pretend that the original didn't change, even it did).

Anything else would break causality requirements (JLS§17.4.8.).


> In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle.

Sound sensible. I think what you've said is any semi-malicious code can give you a primitive array, then modify it from under you in another thread. That's not completely surprising but the clincher is, you can't even take a defensive copy.


Well I mean there are things you can do prevent the JVM from this particular optimization. But you need to use the right tools to tell Java to expect spooky action at a distance, and arraycopy just isn't sufficient.

This would probably do

  byte[] b;

  // A
  synchronized (a) { 
    b = System.arraycopy(a);
  }
  // B
  
Here A happens-before B, and you can assume a read fence at the start of synchronized block and a store fence at end of it, so consumers of b should get a consistent view of the array, and the synchronized block tells Java that you expect a to be modified by another thread so it can't assume it stays the same throughout the execution; and elision of the arraycopy isn't possible; so while the array may be in a weird state, it will at least stay consistent throughout the execution.


> The reality is that it's permitted to do all these things

Yes, given the code in the String class, the JVM is allowed to execute it in this way. I'm not claiming that is wrong, I'm claiming the code in the String class should not have been written in a way that allowed the JVM to do that.

> If you violate that assumption, you get bizarre and often unpredictable behavior.

Java is (supposed to be) a safe language, that maintains its integrity. It does not have the effect that doing something that was not documented as being safe leads to _undefined behaviour_ and therefor all bets are off (like e.g. C).

I'm pretty sure the JDK maintainers intended for String to be a thread safe class with guarantees that can not be broken with race conditions. It's true that the thread safety of String is not explicitly documented, but I believe that's just an oversight in the docs, that String being thread safe just goes without saying. I realize that's hand-wavy, and I don't have any hard proof for that right now, so I understand not everyone might be convinced of that. I'll wait for the bugfix to make that more clear. I don't expect that fix to have a significant performance impact. Fixing it doesn't necessarily require more barriers or more defensive copies. See e.g. this draft patch (incomplete, but good enough to give a rough idea of how it could be done): https://gist.github.com/amaembo/83b4aeab1cd190f2e1dbd75eacb3... .

But the combination with interning, the "hello world".startsWith("hello") example, is clear evidence that it is a bug. Some code running in the same JVM having triggered some race condition earlier should not break unrelated code, it should not put the JVM in a broken state where such code behaves in nonsensical ways.


Something like a ConcurrentModificationException would be nice to have in the "weird shit happens" case, because you may not even be aware of it, and then it might cause headache much further down the line. If you could point upstream that would save a lot of headache and grey hair...


There's no way to reliably detect this type of concurrent modification that doesn't involve introducing expensive barrier operations that evict the CPU cache on multiple points along the code flow and choke out-of-order execution and limit what the JIT can do. If you do not have such a barrier, the CPU is permitted to assume the data hasn't been changed, and may cache it, making detection of concurrent modification impossible. The JIT will also optimize your code based on this assumption. Even if you go out of your way to check that the data is the same you may see stale data when you check!

It may run your code out of order, and even JIT-recompile it out of order, and your code that's like

  int oldValue = val;
  doSomething(val); 
  if (val != oldValue) throw new ConcurrentModificationException();
may be re-recompiled like this, since it looks like doSomething can't modify val or oldValue

  int oldValue = val;
  if (val != oldValue) throw new ConcurrentModificationException();

  doSomething(val);
and then since this if check is trivially always unnecessary, and nothing of consequence ever reads oldValue it may delete them entirely

  doSomething(val);
Introducing a barrier would likely make most Java programs 10x slower 'cause Java does a lot of throwaway string allocation on the assumption it's cheap and short-lived GC is virtually free.

The Java Memory Model is a great accomplishment and if you heed it you will write performant multi-threaded applications with an ease that is rare in other languages. If on the other hand you do not heed the JMM, you will at best get a CME alerting you of this case, but in most cases just get holes in your feet.


> it looks like doSomething can't modify val or oldValue

...sure, if you are using primitives. But here we have an array (reference), the contents of which can be modified by `doSomething(val)` in your example. It is only the reference value that cannot be modified.


I was trying to reduce the amount of concepts I had to invoke to make the example make sense.

java.lang.String is a final class so in this case so doSomething can't be overridden, and strong assumptions can be made about doSomething's behavior and in practice it's likely inlined given that this is an extremely hot method.

Since this is a constructor, the reference can't be aliased by e.g. fields in the class, so static analysis of the code is surprisingly easy. The compiler just needs to answer whether this particular reference is mutated by the invoked function(s).


OK, this makes sense, thanks.


char[] is not an immutable class. Thread safety issue happens on that type, not the fault of String type.

It is weird Java considered it a bug, I expect it to be resolved as "wont do"


Thread safety of char[] is the cause of the bug, but the effect is that String instances can violate their invariants, which they shouldn't even in case of thread-unsafe code.


In my opinion the abstraction is leaky but everything works according to the spec.

I might tell you about another "bug" and "invariant violation" which is possible but is not a bug. Try to use Sets or Maps with keys having broken hashcode.


I can't agree with this analogy. This is not a defective user implementation of hashcode; it is a defective platform implementation of equals (and starts with, etc.).


This is a broken user code written without any JMM understanding.


A proper implementation of hashcode can return nonsense if you mutate the object during hashing.


Sure, but here we're not mutating the object during the String.startsWith() call. Hence I'd say it's a TOCTOU bug in String.valueOf().


Again, this is not a bug. There are many, way too many places in standard library which are thread unsafe. Some of these are actual bugs, but not this one. If Java originally moved in Odersky's direction that won't be the case but now it's way too late to point fingers at these "quirks". In order to properly fix these issues the whole standard library has to be redesigned around immutability and, probably, immutable arrays must be supported at VM level. This is not going to happen.

But it's specified that standard Java classes are thread unsafe by default. So this is a specified behaviour.


This is pretty much the same as breaking a Map by mutating a key during insertion.


Precisely.

Is it good that this can happen? No, Java must be more like Scala and push immutability as sane default.

Is it a bug? No.

Is it possible to "fix" the language and the VM at this stage? Yes, but that would be a new language and new VM. So, that's not going to happen.


Immutability and thread-safety are two different things. There are no thread-safety guarantees made by the JVM or the runtime for the String type, and I suspect your bug report will get the same answer.

You are right that there is an expecation or assumption of such (from the users), which makes it even more interesting when it breaks.

Locking a thread for every allocation of a string is very likely cost-prohibitive and not a great trade-off for the common case. It would make more sense to add something like StringFactory.CreateString(char[]) and instruct developers in using it if thread safety is nessecary.


How would you fix this? Possibly by copying the input array to a local array as the first step? But then you might copy a lot of data, which is in 99% of the cases totally unnecessary. Have a separate threadsafe and a non-threadsafe constructor for char[]?

EDIT: another idea is to do a hash of the input array, and compare between start and end. But that also requires an O(n) walk through the input...


> But that also requires an O(n) walk through the input...

Compressing requires an O(n) walk through the input either way.


Yes. In fact, both String.valueOf(char[]) and constructor String(char[]) are already documented as copying the array contents.


/s redesign a language to prevent parallel mutable use of the array


This won't happen (unfortunately) for obvious reasons.


I'm not sure how much of the underlying implementation details of arrays are leaking into JNI, but one way to approach it would be with first class copy-on-write for arrays or opt-in immutability (freezing).


> How would you fix this? Possibly by copying the input array to a local array as the first step?

Upon encountering a non-Latin1 char, convert the existing latin1 array to an utf16 one, append the char (converted), then resume iteration in utf16 copy mode.

This way you know the char which made you bail out on Latin1 is part of the output.


Copy the array, then use alias analysis to remove the copy most of the time, by having two versions of the String constructor, with a version chosen based on aliasing.


Accepting an issue doesn't mean that there is a bug.

> While the javadoc doesn't state it explicitly, immutable classes in the core library are expected to be thread safe.

If it's not documented it's just an assumption.


Yes, but at some point one has come out of the theory domain and face the real world. So in practical terms it’s closer to a bug and even the “official” guys from Java accepted it as a bug.


Could you provide a link to the documentation to proof it's a bug?

The author could call it 'probably bug' or 'misleading String constructor documentation' but instead they state it's a bug without any proofs.


Here: https://bugs.java.com/bugdatabase/view_bug?bug_id=JDK-831190...

It was accepted as a bug. I'm not saying I would condemn anyone to the death penalty based on that but being accepted as a bug there has to have a significant weight attached to it.


> It was accepted as a bug.

I know. It was in this thread above. So there is no link to the specification to proof that it's a bug.


In my opinion this is not a bug and there is no general fix for that without significant overhaul of the whole runtime and standard library.

In my opinion the author might need to read the vm and memory model spec and understand that such things are expected in Java.

There are real bugs in standard library singletons (e.g. concurrent Filesystem calls might fail during singleton initialization, while plugins are loaded). These bugs are being ignored for years.

The author must not expect any "fix" for this in foreseeable future, but they might expect being laughed at.


Eh, I don't think the author is to be ridiculed. We're all wrong from time to time, I'm wrong in most of what I say, I'm sure you've been wrong on occasions, this is fine.

The author's views on how concurrency works is very common and taught in schools and textbooks, but also quite inadequate.

Instead let's take this opportunity to talk and teach about the JMM and deepen the collective understanding of the many unintuitive behaviors of multi-threaded applications.


Maybe there could be a JMM 2.0 where at least some of the unexpected behaviour is removed. Maybe even at the cost of performance. If something is misunderstood by >99% of the average programmer population then clearly it is not the best solution to the problem.


If that many programmers don't understand the JMM, it's because that many programmers haven't looked at it.

It is the key to reasoning about Java concurrency. Any other model that is not the JMM is flat out wrong. It is virtually impossible to write correct concurrent Java without understanding the memory model.


A defensive copy is the cheap way to fix this.

The bug is a potential security vulnerability since it can affect the value of interned strings, which can break other libraries.


Even the example in article alone. Imagine someone intern a broken "script" string, and people trying to sanitize HTML script tags no longer find it.


That still has to be done ad-hoc and might kill performance for many usecases.

This is not a bug, but a specified behavior.


Changing the evaluation of the expression x.equals(y) to false when x and y are strings with the same value is not a specified behavior.


It is. The "buggy" code isn't marked as thread safe.


That's true, but in the case of Strings in particular they are generally considered to be thread-safe by virtue of being immutable (and the Javadocs themselves say this in many places). Concurrently modifying the input character array may seem like willful abuse in this case, but I suppose there might be some carelessly written code out there which does it and the post shows how it would create some weird and very hard-to-debug behavior. And I can see the argument going both ways as far as the Javadoc: on the one hand, it only promises to return a String representing what is currently in the input character array, which seems to warn against concurrent/subsequent modification at least obliquely. But on the other hand, it does seem to commit to an up-front defensive copy of the input character array which doesn't quite take place, and without which things go a bit haywire if the encoding of the string changes.

I don't think the correct answer is to sacrifice any performance to guard against this weird edge case, though.


As the article points out, the only thing the Javadoc guarantees is the subsequent modifications of the character array have no effect. It says nothing about concurrent modifications.

The type whose thread safety is in question here is not actually String, but char[]. I'm not going to say it is always wrong to share char[] between threads (as a primitive array of something other than double and long, the Java Spec does make some guarentees about char[]), however it is almost always wrong to be sharing char[] between threads.

If you want a language that protects you from this type of mistake, you simply need something more advanced than Java.

This isn't even the biggest hole in Java's safety. Java has a type system, which is at least a nominal claim to type safety. Yet, you can do things like:

    Integer[] foo = new Integer[1];
    Object[] bar = foo;
    bar[0] = new Object()
And the compiler will let you (although the runtime won't).

Its even worse with generics:

        List<Integer> foo = new ArrayList<>();
        List bar = foo;
        bar.add(new Object());
        System.out.println(foo.get(0));
        Integer x = foo.get(0);
Will compile, and won't even through an exception until the very last line, past the point where you have retrieved and used a non Integer from a List<Integer>

In fairness to java on the last one, it is at least a warning, and a deliberate compromise to support backwards compatability when they introduced generics.


While you are technically right in all your points, I think you exaggerate the problems regarding Java’s type system. Arrays are indeed a pain point from different times (the mistake of their covariance even “infected” C#), but generics are not problematic at all. In fact, code that compiles without warning is guaranteed to never get a ClassCastException.

Every language’s type system gives you escape hatches, so that last example is similar to how one might implement an optimization library in Rust where you have uninitiated elements. Hell, there you don’t even get runtime errors if you get it wrong, it will just segfault. So I really disagree with this notion of “it even worse with generics”.


> how one might implement an optimization library in Rust where you have uninitiated elements

The correct way to do this is use the MaybeUninit<T> type. Then you're responsible for correctly initializing a T before you call MaybeUninit<T>::assume_init() to get a T.


And that MaybeUninit type uses an unsafe call under the hood as an escape hatch from the "draconian" compiler's rules. Every practical language has these.

(nonetheless, thanks for mentioning the MaybeUninit, that was what I was thinking of, but didn't remember the name -- I haven't programmed in Rust for a long time)


MaybeUninit<T>::assume_init() is the unsafe call. The promise that this is actually a T and not just a T-sized blob of uninitialized memory is made by that unsafe call, and since it's unsafe it's your responsibility as the programmer to obey the constraint, that this is, in fact, an initialized T.

Depending on how long "a long time" is, you might well not actually have been thinking of MaybeUninit. The prior solution (until mid-2019) was as you describe, and it was in fact unsound.


The point I was trying to convey (badly) with this example is that in both the assign to raw generic and assume_init case, the developer overrides the compiler's knowledge for his/her, demonstrating that most languages have similar escape hatches, because they are sometimes simply needed.


Yes, what you've described is usually unsound†, which is why it's deprecated (but it's offered in the standard library and so Rust policy is not to remove it since there are at least in theory sound ways it could have been used, shame to break them).

MaybeUninit<T> doesn't override the compiler's knowledge. It's using one very, very clever trick. Take a look at how MaybeUninit is defined. It's a union! The compiler can see this might be a T, but it can't ever see whether it's actually a T right now. As a result all the mis-optimisations which occur with the unsound approaches can't happen. MaybeUninit<T>::assume_init() is just a union read, which, sure enough, is an unsafe operation in Rust (writing to a union is safe, but reading from one is not) and that's why the function is unsafe.

† This is technically sound for Zero Size Types, because then we're basically saying e.g. "You see nothing? Well, I promise it's actually no Spoons, like, an array of 0 Spoons". So that can't cause the compiler to emit code that we didn't expect - the Rust compiler isn't going to emit code to read or write no Spoons, whether or not they "exist" according to type analysis.


I don't think this is a fair attack on Java's type system. In both examples you do what is effectively casting to void pointer - escape type system guarantees.


"The type whose thread safety is in question here is not actually String, but char[]"

Not quite, it is String constructor that has the race condition, char array is incidental there.


The thread-safety guarantees can only be obeyed if the arguments are thread-safe as well.

For example, take a java.util.concurrent.ConcurrentHashMap, which is thread-safe. If your value object is mutated concurrently, and is not thread-safe, you can't expect anything sensible to happen if you call containsValue().

Likewise, String's char[] constructor can only be thread safe if there are no concurrent changes to the argument.


> The thread-safety guarantees can only be obeyed if the arguments are thread-safe as well.

While that’s generally true it’s not the case here, char[] does not invalidate anything upon mutation it just changes.

The issue is that the ctor checks for a property (all chars are under 256) then assumes that holds indefinitely.


> The issue is that the ctor checks for a property (all chars are under 256) then assumes that holds indefinitely.

You could say that this is a classical TOCTOU bug as the ctor assumes that a mutable argument does not mutate.

The core question[s] here are whether this assumption is and should be in the contract.

Protection against these kinds of bugs is only possible with deep copies of mutable arguments or COW semantics. So you either implement callee defensively and slow down general case (remember, this is stdlib), or you leave concurrency control to the caller. Probably every sane general purpose language does the latter.


> You could say that this is a classical TOCTOU bug as the ctor assumes that a mutable argument does not mutate.

Indeed.

> Protection against these kinds of bugs is only possible with deep copies of mutable arguments or COW semantics.

In the general case yes, but in this case no, you can solve the issue and improve performances by implementing the thing correctly, in a single pass over the input. This way you don’t get a split-brain situation.


I wouldn’t point the finger at the ctor; making a defensive deep copy of mutable parameters is not in the contract.


Immutable types are only considered thread-safe after their construction, though.


I’d argue that while a string is thread safe, this is not actually a string yet.


Probably guard intern() instead would be enough? It would be enough to stop the broken string from pollute the whole process permanently.


Not really. I'm sure there are some edge cases where the ability for someone to do silly things like this is harmful but otherwise this goes in category of things where the general advice is "just don't do that and you're fine".

Basically, if you look at why the standard library would not go out of its way to prevent you from doing stuff like this it boils down to the same reason as why they put the string compression optimization in to begin with: doing so would make things slower rather than faster. And that's just not desirable in something as performance critical as string initialization, which is a thing that a lot of Java programs would do a lot.

Having string compression helps save some memory but it's a weird low level hack. It's fine as long as you don't actively try to break things in your own program. You'd have to jump through some hoops to do it accidentally. That is not likely to happen just like that.

So, not a bug but a known limitation that you should be aware of if you go down the path of doing lots of concurrent modifications to arrays that may or may not be used to create Strings. Java has lots of nice primitives and frameworks to make writing such code less error prone but it's up to you to use those properly and it's entirely possible to write all sorts of code that has lots of concurrency issues.


How do you plan to have any assurance that none of your dependencies is doing it? You could well consume a string you didn't build and be affected.


It's a TOCTOU bug [1], a well known category of bugs.

[1] https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use


Also one more argument for “parse don’t validate”. The code validates a mutable input, and assumes that validation holds thereafter. An incorrect assumption as it turns out.


While that is good advice, it can't apply to mutable input. Unless a defensive copy is made (at least, under the current system), concurrent modification can still occur. The new type's underlying data is still being accessed.


There are necessarily copies being made since it's converting a `char[]` to a `byte[]`, the problem is that they're not done correctly.

Currently the code tries to encode the chars, and if it fails it completely bails out and restarts with a code unit copy. The bailing and restarting is what offers the opportunity for TOCTOU.

But if instead of bailing it converted the data collected so far to code units, then appended the code unit on which it failed, then switched to a UTF16 copy loop, the result would necessarily be correct (at least insofar as a UTF16 string would not contain just latin1)

And in fact this would likely be more efficient than the current version, because we already know that everything we've already converted is valid latin-1, which means we can literally just copy that to every other byte. There is no need to re-do that validation and conversion work. Which is currently the case, because StringUTF16.toBytes redoes the entire thing from zero.


Ah, good points. Thank you for the insight!


For locally generated strings, it's not a concern. For String.intern, this is actually a very serious bug that should be treated as a security vulnerability.


At best it is only a security vulnerability if you are running different trust domains within your process. That is something Java supports, but most Java code does not attempt to take advantage of it. Most Java programs have 100% of there code being allowed to use reflection to simply hack the program however they please, including corrupting the intern pool.

But still, for the few people who actually trust Java's isolation features and use it, possible bypasses are a concern. Having said that, I don't see how you can turn this into an attack on String.intern.

The guarentees about interning are as follows:

> When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

> It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.

If you pollute the pool with a "broken" String, all that would happen is that that String would not be returned in place of a logically equivalent correct String, and vice-versa.


If a security sensitive operation was (for whatever reason) calling startsWith (as shown in the blog), then it might return an incorrect result. I'm merely speculating that an exploit is possible, and I'm not aware of any code that would be affected.

Regardless, the fix is quite simple. The public String.intern method just needs to validate the encoding first. This will have little impact on performance because the String.intern method is rarely called.


Their point, which I agree with, is that shared memory concurrency requires correct code, and cooperative synchronization between threads, in order to avoid data races; and if your security model is to run untrusted or unverified code with shared memory concurrency, then you cannot ensure correct code or synchronization, and you cannot guarantee against data races.

It's perfectly fine to code String.intern() defensively against such dodgy String-s, but in any sizable program with shared memory concurrency and untrusted or unverified code, there will be millions of other potential data races. The only sensible choice would be to not use such a mix.


My thinking is the same. I doubt this is an oversight. Making the String constructor thread-safe would likely slow things significantly. Great point about things being assumed not thread-safe. The JDK is pretty thorough with documenting thread-safety.


> Making the String constructor thread-safe would likely slow things significantly.

By my reckoning it would speed things up, at least going by the Java code.


By what means? The only ways I would expect are either unconditionally duplicating the array or mutex, and I don’t think the mutex wouldn’t be simple. Adding a sync block on the input could be done but that’s assuming nothing else is locked on it (but probably points to there being a race condition elsewhere if there is..).

Unconditionally duplicating the array would use more memory and wouldn’t be faster.


Inlining `StringUTF16.compress` and `StringUTF16.bytes what the code currently does is essentially this (using python as pseudocode):

    latin1 = bytearray(len(input))
    for i in range(len(input)):
        c = input[i]
        if c > 255:
            latin1 = None
            break
        latin1[i] = c

    if latin1 is not None:
        return LATIN1, latin1

    utf16 = bytearray(len(input)*2)
    for i in range(0, len(input)):
        c = input[i]
        utf16[2*i] = c >> HI_BYTE_SHIFT 
        utf16[2*i+1] = c >> LO_BYTE_SHIFT

    return UTF16, utf16
The issue occurs because `input` can be mutated between the moment it finds a char that's above 255 in the first loop and the moment it visits that same char in the second loop.

The solution is to not do that, but instead something like:

    latin1 = bytearray(len(input))
    for i in range(len(input)):
        c = input[i]
        if c <= 255:
            latin1[i] = c
            continue
        
        utf16 = bytearray(len(input)*2)
        for j in range(i):
            utf16[2*j] = latin1[j]
        latin1 = None
        utf16[2*i] = c
        utf16[2*i+1] = c >> 8
        for j in range(i+1, len(input)):
            c = input[j]
            utf16[2*i] = c
            utf16[2*i+1] = c >> 8
        return UTF16, utf16

    return LATIN1, latin1
This means if you find a char that's above 255 you will always append that char to the UTF16 array, there's no possibility that someone will change it under you because you append the exact same char you tested. So you can not get into the situation the essay describes, a utf16 string will always contain at least one non-latin1-code unit.

Non-vectorised performances should an improvement as latin1 to utf16 is a trivial operation (just copy every byte of the input to every other byte of the output)

Though if you vectorised the char to utf16 conversion you now vectorise two loops on bailout (latin1 -> utf16 up to i, then char -> utf16) which is probably less efficient. I don't know if the JDK has vectorised optimisations, the source has "HotSpotIntrinsicCandidate" annotations but I don't know to what extent the intrinsics go.


Coming back to this the answer is probably not strictly about concurrency and so there is a possibility something would need addressed about this.


I remember a bug in Visual/Digitalk Smalltalk: conversion from integer to string (method printString) was not thread safe, because it used global buffer for better performance. Very rarely - when we used a progress indicator in another thread (called Process in Visual Smalltalk) - this global buffer was overriden by another integer to string conversion... It took a year to find out the reason why our code sometimes got wrong results... I submitted the bug to Digitalk and they decided to not fix it (performance reasons). Only in later versions they abandoned the global buffer...


I don’t know the terminology for it. But the behavior that you just described isn’t just “not thread safe”. It’s violating basic assumptions. I’d call that hostile or a trap.


A function that can’t be called while it’s already running is “non-reentrant.” strtok used to work this way (before thread-locals), and some languages predating C that used globals instead of stack frames.


> Attempting to use types in a way that exceeds their documented thread safety has always been allowed to leave your program in an inconsistent state.

Wasn't there an attempt for Java to define the range of outcomes even of programs with data races?


Yes. Java promises (unlike C, C++, Go, etc.) not to have Undefined Behaviour as a result of data races, however you do lose Sequential Consistency, which means humans aren't able to successfully reason about non-trivial programs with data races.

More specifically Java says something like, if you race a value of some sort, even a complex value like a hash table, it doesn't get fatally damaged, but its new state is some state it had, or might have during other executions.


Purely out of curiosity, is that much better? At least the data won't be complete garbage, but there'll still be a logic bug lurking in the details.


Sure. In a C++ program all races can cause anything physically possible to happen. So e.g. maybe the program is processing Calendar data, but it's running as the GeneralHR service user on this machine which technically has access to the payroll database. So now Jimmy Smith's insane 4863 meetings which invite himself and a meeting room that hasn't existed for six years causes a data race and the description of a meeting ends up written over part of the stack, which explains why the meeting descriptions were gibberish, because now the machine connects to the payroll DB and sets Jimmy's monthly pay to eighty five thousand six hundred and fourteen bucks... Then it crashes. Oops, bug in the calendar app, well, it worked the next time, so, it's probably fine. Right?

It's not Good that the Java code has a bug you don't understand, but it's very unlikely to have completely wild outcomes like giving Jimmy a massive pay rise.

When they began this work, as I understand it the Java engineers imagined humans can successfully reason about the resulting bugs, but turns out that's not true. So this work might have been deemed unnecessary if they'd known that but obviously it's hard to know without trying.


// compareTo does consider them equal (even though its javadoc

// specifies that "compareTo returns 0 exactly when the

// equals(Object) method would return true")

Seems like at least a documentation error.


Calling this a "bug in java.lang.String" is silly. The same "bug" exists for all functions that take mutable objects. If you take a map and lookup two different keys, yep, that's a "bug".

The bug is the other piece of code that introduces the data race in the first place. You can argue the case for languages like Rust with it's borrow system, or others that use linear types or something along those lines, to eliminate the possibility of this happening, but it's quite misleading to say that the innocent user of a mutable object is the source of a bug. You may as well say there's a bug in `printf("Hello, World!\n");` in C because you could have another thread writing random values to random memory, running `while(1) { *((unsigned char*)(void*)rand()) = rand(); }`


This bug can change the meaning of code in third party libraries by altering the set of interned strings.

If any input to a program can be converted into this race, then it becomes a security vulnerability. It would already be a security vulnerability in the old Java sandbox model.


If you have a data race, your program can have all sorts of inconsistent state in all sorts of objects -- even standard ones. In Java, I don't think you have anything like "undefined behavior", sometimes jokingly but meaningfully called catch-fire or launch-the-nukes semantics, but that doesn't prevent a data race from breaking program invariants in ways that don't immediately or ever generate exceptions.

There was a little optimization post a few days ago: https://news.ycombinator.com/item?id=36618344. Basic summary of the program, given a NUL terminated buffer like "spppssss", return an int where each 's' adds 1, and each 'p' takes 1 away. Several of the alternate implementations in the HN comments involved using strlen first, which has the nice of effect of being highly vectorized, and knowing the size, the compiler can better vectorize the actual logic loop, and a human can vectorize better still. If we imagine another piece of code that concurrently modifies the input buffer to write a NUL to the 0 index, it would change the length to 0. To be generous and to simplify the discussion, let's say the concurrent modification happens to become visible to our thread after strlen but before the following loop. Is there now a bug in this implementation?

If the point is that java.lang.String must ensure its pre-conditions and post-conditions are always maintained, because it's so fundamental to other security functions of the JVM, then there is no solution that allows for untrusted/buggy shared memory concurrency. Any suggested modifications to the constructor like "add a lock" or "copy the buffer first" will not work, and are missing the point entirely.

If the issue is the string intern-ing of such a dodgy String, then the best you can do is code intern() more defensively, to throw out detectably dodgy input. I could then understand saying intern() has a bug, but I don't think that was the point of the post, since there is no discussion of the intern() code.


That's a lot of talking around the issue and I think you're missing it.

We're talking about constants being corrupted in code which has has no control or data flow relationship with the code doing the corrupting. Code which is totally bug free can exhibit impossible behavior.

That might not sound disturbing to you if you're taking about strlen(), because languages with unrestricted pointers are full of undefined behavior. But Java isn't like that. Java doesn't come with nasal demons.

I didn't say intern() has a bug.

The bug is Java having undefined behavior of the kind which breaks a language invariant as strong as string equality.

You're arguing that there isn't a bug because non-determinism comes from concurrency. Well, you can't have concurrency without non-determinism. That doesn't mean the language needs to stop working.


This is perhaps an unnecessary clarification, but there's plenty of non-determinism that's perfectly fine. Data races are non-deterministic and bad, but "non-determinism" itself is not a scare word.

I spent a good decade of my career writing C#, which is similar enough to Java for these discussions, and the strlen issue I talked about is not limited to C -- it is also an issue for a managed language like C# or Java. I think you would do well to discuss the problem, instead of dismissing it because myself and the original post I referenced are tainted by the nasal demons of undefined behavior in C. Don't use that as an escape hatch from dealing with the general issue presented in the problem.

I know of no language in existence where even in the face of data races, there is a guarantee that all data invariants are maintained otherwise an exception is thrown. Note that Rust is not such a language, as its goal is to make data races compilation errors, but unsafe code can introduce data races that do not generate compilation errors.

I would also suggest you offer even a sketch of a fix for the String(char[]) constructor if you think that is the source of the bug. I don't think one is possible. I'm not aware of any method to unilaterally atomically copy an unbounded char[], for example. Even if there is, I suggest you go look for other functions that take other kinds of mutable objects as parameters -- I doubt that for each and every case there is a method to unilaterally atomically deep copy those objects. Such copying is just one suggestion, feel free to resolve it any way you think works.

If you don't want to sketch a solution, another experiment you can try is to use a non-thread-safe mapping type like TreeMap, and make structural modifications (add/remove elements, not just modify existing values) from multiple concurrent threads without any synchronization. Can you guarantee that in all cases where invariants are broken, that an exception will be thrown?

I think it would be prudent to do something in intern(), and one could arguably call this a bug in intern(). The string intern table is invisibly and pervasively read shared state, and relatively uncommonly mutated. It would make sense to take extra care when mutating the shared state to attempt to ensure no invalid strings are added.


It’s a bug for constructor functions, in my book. I certainly always code defensively to prevent such misbehavior. A successfully constructed object should obey its documented interface contract, period.


> A successfully constructed object should obey its documented interface contract, period.

Could you provide the contract you are talking about?


Yes, the contract of String::equals: The result is true if and only if the argument is not null and is a String object that represents the same sequence of characters as this object.

The article constructs two String objects representing the same character sequence, for which however equals() returns false, in violation of the above-quoted contract.


Functions have an assumed pre-requirement not explicitly spelled out in every single javadoc, that the program does not have a data race. May as well ask for every javadoc to explicitly include a pre-requirement that no cosmic rays flip bits of memory.

But to be serious again, I'm sure if you look at Array or whatever type is involved in char[], you'll find that it is explicitly marked as not thread-safe.


> Functions have an assumed pre-requirement not explicitly spelled out in every single javadoc, that the program does not have a data race.

Maybe in your programs, not in mine. I want my classes’ instances to fulfill their contract once it has been constructed. If it can’t do so, construction should fail with an exception.

Note also that for example the OpenJDK’s String::hashCode implementation is formally not data-race free (racy single-check idiom), so your assumption as stated doesn’t hold.


If you mean that you construct your program in such a way as to not have data races, then I do the same. Even with mutable objects like char[], it's relatively straightforward to just not share such objects, and in the limited circumstances where you must share, to synchronize.

If you mean you somehow unilaterally do away with data races on shared mutable data, even if the concurrent thread does not synchronize in any way, then I am genuinely asking what you're referring to. To be specific, I'm not aware of any implementation of `String(char[])` that operates correctly if another thread is mutating the char[] without synchronization. The JMM guarantees are extremely limited even for distinct indices, hence the need for AtomicIntegerArray, etc.

If you mean you cannot eliminate the data race, but you can do a post-condition check of the String invariants, and you want that checking code to run even in non-debug builds, then fine. I think there is a place for "production asserts" like that, but I think they need careful application, and they don't belong at every level of the stack. I'd check the validity of the String in intern(), but not in every single String ctor as a post-condition. And certainly in most languages with weaker memory models than Java, but my guess is even in Java, such checks provide no real guarantee that data races can't ruin invariants. There's no magic tape to repair that crack in the foundations.


So they should add that mutating arguments during object construction may lead to not-so-successfully constructed (invalid) object. __Mission failed successfully__


No. The job of a constructor is to establish the class’s invariants for the new instance, or else fail with an exception. This bug in the String class fails to do so.


> The job of a constructor is to establish the class’s invariants for the new instance, or else fail with an exception.

Could you refer such statement in Java specification?


Invariants and all methods -- including constructors -- enforcing them is OOP 101. I would be surprised if they felt the need to make such a callout in the language specification.


It is probably a gotcha. But yeah if you hit this problem you have bigger worries, the code immediately before the constructor is not thread safe.


This is exactly why java needs frozen arrays [1].

The safe thing to do is freeze the array before doing anything with it. Then, you can rely on COW to copy to the array if someone is modifying it concurrently with you reading it. In the general case, you'd have fast string creation and in the tricky case you simply pay the clone cost as a penalty for being dumb.

[1] https://openjdk.org/jeps/8261007#:~:text=How%20do%20I%20use%...


I would love to have this in Java, hopefully this JEP makes it!


As would I. There are a ton of places where the JVM is defensively copying arrays. It often comes up (for me in my work) as a performance problem.

A real common example of this is `Enum#values`.

Ideally (IMO) this applies some aggressive COW operations. So perhaps internal to the enum you have a frozen array of the values and for "values()" you return something like `VALUES.unfreeze()` which points to a transparent unfrozen array. On a write action, you'd copy the array but in the general case you'd simply read from the frozen array until someone does something dumb.

You could take it a step further and simply expose the `values` field or add a new "frozenValues" method to not break existing code. In either case, you'd end up with faster performance because the JVM isn't copying the internal array needlessly.


This doesn’t help when the parameter is one of the collection interfaces, or CharSequence, or the like. You always need to defensively copy their contents first to “freeze“ their values.


Java does have immutable collections. It's just not an explicit type.

Lots of common ways to instantiate arrays (i.e. Arrays.asList) generate immutable lists


Arrays.asList doesn’t generate an immutable list, prevent writes to the array, or prevent modifying the array via the List interface.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base......)


That's instantiating a List, not an array.


As you would expect in a vaguely modern language, Java's "List" interface is typically backed by a growable array as the type ArrayList. Only people who have no idea about caches would think List should necessarily be some sort of Linked List type.

You would use Collections.unmodifiableList to make it unmodifiable.


Collections.unmodifiableList doesn’t prevent anyone who has a reference to the original List (or the array behind it) from modifying the List. Calling that in the constructor would not help.


Yes, ArrayLists are array-backed, but Arrays.asList instantiates a List and does not instantiate an array.


As I explained, List is an interface. You can't instantiate it, values are instances of types not interfaces.

All that return type is telling you is, surprise, asList promises what you're getting implements the List interface. See if you can guess how you can implement the List interface using an Array. Still struggling, here's the one line of a typical implementation:

return new Arrays.ArrayList(a);


You should at least try to be correct if you're going to be so insufferable about it.


That's a very interesting finding. Nowadays Java security is a joke, but back in the day, Java security was a serious topic. Users were able to run downloaded applets in their browser, so protecting the sandbox was important. It's very likely that using those kinds of "corrupted" strings would allow to break out of this sandbox, because that protection code definitely relied on strings being sane and correct.

I can't imagine this behaviour to cause much problem with modern Java, nobody runs untrusted code anyway. But good to know.


Every time, without fail, somebody shows a bug about a piece of code that we take for granted (In this case, the String class) the bug is related to concurrent modifications.

Concurrency is so hard that even OpenJDK developers can't prevent these kind of bugs


Go has trouble with this too. You can cause undefined behavior with completely safe code by making concurrent modifications to a fat pointer. The writes won't be atomic, and the pointer can be interpreted as the wrong type. E.g. in this example, the B.foo method will be called with a C value as the receiver, which tricks it into accessing memory at 0x1000 and segfaulting, but you could also arbitrarily access any memory this way without using unsafe.

https://go.dev/play/p/y4z_vs-I1jb


Is not that OpenJDK developers can't prevent these, but there's a forbidding cost for doing so.

The simplest "safe" way of doing this involves defensively copying the input argument. However, the `compress` function will likely make yet another smaller copy, making the constructor very allocation and CPU intensive.

In fact, due to the fixed array size in Java, all thread-safe implementations must either allocate two arrays to hold the two possible encodings, which guarantees one piece of garbage, or iterating the input array twice.

For such a core class like String, this is probably unacceptable cost. And the constructor is not documented to be thread-safe, so no one should expect it to.

In reality, there are much more impactful data structures to abuse in Java.


This is the exact kind of bug that Rust solves with its borrowing system. The problem is that Java has no way to express the concept of "something that nothing else can modify while I'm looking at it".


Arguably, C++ arguably already solved that specific problem with const safety. I was flabbergasted back in 1995 that Java had dropped that notion.


I don't think it does. In C++ taking a const argument just means that you can't change it. But since the caller can pass non-const as const there is nothing stopping them from mutating it.


You can also use const_cast to remove the const-ness. You could then mutate the value.

I believe that's undefined behavior territory (the mutation, at least in most cases), but I'm sure someone is doing it in the wild.


I haven't done much C++ in a few years but IIRC you can remove const as long as the "original" value isn't const. So `const_cast((*const Foo)foo)` is fine if foo is not const.


Isn’t the linker entitled to put a constant object in a read-only page of the binary if it doesn’t require a ctor at runtime?


Yes. That is why global static `const`s (with exceptions like if they have mutable fields) can't legally have their `const` casted away.


If the constructor here would take a const reference, it would have the same problem (the existence of a const reference doesn't preclude non-const references to the same object). If it took an argument by value, it wouldn't make a difference whether it was const or not, because no other thread could have access to the same object, because it would be copied.

In Rust, on the other hand, if there's a mutable reference to an object, there can be no other references to the same object at the same time.


Mutexes etc ... exist in Java.


Yes, but Java will happily accept code that doesn't use them where it needs to, leading to bugs like this one. Rust catches that mistake at compile time instead.


Not if the memory has been allocated on a shared memory segment, Rust has no control over what other processes might do.


Sound Rust code would either make functions touching the shared memory marked unsafe, or would do a defensive copy out of shared memory.


That safe layer around unsafe still has no way to validate the consistency of the data.


It can't proactively validate the data while it's in the shared memory.

If you do your validation during accesses it's fine. If you copy the data out of the shared memory it's fine.

Or you could use a mutex to protect the data between validation and use.

If you're worried about another process editing the memory without taking the mutex, that's equivalent to worrying about other unsafe code editing the memory without taking the mutex. The solution is the same in both place: don't share memory with completely arbitrary code. When people compare languages and techniques, they (rightfully) assume you're not doing that.


Right, but in rust, not using one is a compile time error. In Java (as you can see by the article), not using one is a silent bug at runtime.


Only for in-memory data structures under Rust's control, if it is related to OS IPC, Rust cannot do anything.


This isn't true.


This is a heavily optimized system library - you don’t use mutexes here. Rust wouldn’t help here, if mutexes would be fine, they would have been used. Especially that this is the result of C++ and Java code simultaneously.

Hell, it’s probably one area where rust’s benefits are a “hard sell” — you would have to constantly be in unsafe rust manipulating pointers manually as the compiler can’t reason statically about what a layer built on top does without a huge runtime cost (huge, as in you really don’t want to lock/unlock, or even refcount in these hot paths).


No idea why Thaxll and the other comments are mentioning mutexes.

The equivalent (*) API to this Java API in Rust does exist; it's `String::from_utf8(Vec<u8>) -> String`. And the bug in TFA does not exist there. Since the signature consumes the `Vec<u8>` it's impossible for the caller or any other code to still have access to it to be able to modify it concurrently.

Also consider the similar API `str::from_utf8(&[u8]) -> &str`. The bug in TFA does not exist here either. Since the signature takes a `&` borrow of the slice, it is not possible for anything else in the program to have a `&mut` borrow of that slice to be able to modify it concurrently. After the function returns other parts of the program could mutate the slice, but they would only be able to do after the `&str` that is derived from the slice is dropped. So once again nothing would be able to mutate the slice and observe the effects in the `&str` itself.

All these "unable to do" are enforced at compile-time, because "consuming a value makes it unavailable to other parts of the program" and "cannot get a `&mut` to a value as long as a `&` from that value is still in scope" are all typesystem concepts. No mutexes or other runtime checks are involved.

(*) "Equivalent" in that it's an API to convert a sequence of bytes into a string. The Rust API doesn't have the encoding thing of the Java API because the Rust String / str are required to be utf-8 internally. But if an exact equivalent of the Java API did exist in Rust, the signature would still be the same wrt consuming `Vec<u8>` / borrowing `[u8]`, so it doesn't change the overall point re: concurrent modification. Furthermore, concurrent modification would cause problems even with Rust Strings if it was possible, because it would allow a String / str to become invalid utf-8 after they'd already been checked to be valid utf-8, which Rust considers to be UB.


> No idea why Thaxll and the other comments are mentioning mutexes.

Thaxll mentioned mutexes in a reply to the statement

Java has no way to express the concept of "something that nothing else can modify while I'm looking at it"

Even ignoring the performance aspect that is not the perfect answer, though. AFAIK, the JVM doesn’t have a notion of “you can only modify foo if you hold mutex bar”. That remains something the programmer must enforce.

On the other hand, tooling exists to help them, for example https://www.javadoc.io/doc/com.google.code.findbugs/annotati...


The scenario I was imagining and commenting on was about “implementing a JVM with Java’s semantics in Rust”. Of course if we limit the language itself to safe Rust, we get data race freedom, but at a quite significant price for a high level language (it constraints possibly correct programs down a lot). But Rust would not help with relation to the primitives here at all (implemented in C++/Java).


"Rust wouldn't help"

"This bug can't be implemented in rust"

"I meant that Rust doesn't fix the bug in Java. Even if you write rust code, you can also write buggy java code too so rust didn't fix the java code"

You're the only one here who thought "rust" meant "java semantics implemented in rust" in this context.


Because in case of the problem at hand, this is a complex interplay between Java's standard library's Java code and the underlying JVM. There is not much to discuss regarding "rust would make the code safe", because so does JS as it is single threaded.. That's hardly interesting.

If we put Java on top of Rust, then no, Rust no longer can help about this. That was my whole point.


> That's hardly interesting

Rust and javascript having differences which prevent this class of bugs might not be very interesting, but it's more interesting than your point.

Unless I'm misunderstanding, your point is that a bug in Java cannot be avoided by switching languages to Java.


No, my point is that changing the implementation language of java wouldn't have helped here.


The problem here is that we don't want a mutex. Once you have it the performance cost would apply in runtime. In fact, to write this code in rust you would need to write unsafe code to get around the problem where Rust forces you to write correct but inefficient code.

This code is intentionally not thread-safe. This isn't so much a bug but an interesting thought experiment.


Rust absolutely helps here because in Rust it’s simply impossible for someone else to mutate something concurrently to you holding a reference to it. Code equivalent to that in the article simply won’t compile in Rust. This is, like, the very point of Rust’s borrow system. You can share, xor you can mutate, but not both at the same time. This holds equally for single and multi-threaded code.


In safe Rust, that is. For unsafe Rust, I don't know exactly which bets are off but it's more than none.


In unsafe rust this is a concurrent modification of an object with shared references, which is an UB.


Unless everyone is just holding pointers


Huh? This is exactly where Rust would help. In Rust the caller of the constructor would either have to add mutex if they needed concurrency, or just use the constructor without mutex overhead if they did not.


It's a compile-time cost instead of a runtime cost.


99% of the time, the calling code trivially owns the array. If you are in a situation where the compiler cannot figure that out, then you need to deal with it regardless of what String does, because the exact same problem exists by the caller itself having a reference to the object.


Java compiler have a (not too bad) escape analysis engine. For something as low level as String intern/optimisation, it can be done in compile time


What about Rust’s borrow checker (affine types) enforces the use of mutexes (or other sync prims) here?


As sibling comments point out, mutexes aren't needed here. But to answer your direct question, Rust's type system enforces the use of mutexes to access protected values (if you're using the stdlib Mutex implementation) by only allowing access to protected values through a MutexGuard object which is created by locking the mutex. The borrow checker enforces you can't access the MutexGuard concurrently, so therefore you can't access the protected value concurrently.


Why would it need to? Rust's borrow checker makes it a compile-time error to share a mutable array between threads. No need for run time synchronization.


Right, that’s my understanding. But OP and a sibling thread here seem pretty sure about the mutex thing.

I think there’s some nuance, but not in the general case.

Shared memory, lazy statics in async blocks, and asynchronous constructors might have different initialization order mechanics that would require synchronization — but even then, the borrow checker would at least point it out


Without a mutex, you can’t even write code equivalent to that in the article because you cant mutably share as you pointed out. With a mutex you could – and the mutex would prevent data races (but not race conditions in general) – but indeed mutexes are a red herring here (at least in the specific sense of a runtime synchronization primitive).

In Java you can’t synchronize defensively because synchronization requires that everybody who has access to the shared resource cooperates with you. And even if you could, you wouldn’t want to, not in this sort of a case.

In Rust mutexes own the data they protect, and make it impossible for anyone to access the data without locking the mutex first, but again, an API like this would clearly not bother with dealing with mutexes but rather take a normal compile-time-checked borrow.


Unfortunately I can no longer edit my comment, please have a look at my reply: https://news.ycombinator.com/item?id=36690710


Out of interest, how should this be handled?

Is this a bug in Java which should be fixed (looks like that to me)? My understanding was Java generally doesn't do "you did an undefined behaviour, so it's your fault", except for specifically marked very low-level interfaces.


I can only think of a couple of ways to fix this, none of them ideal from a performance perspective:

- Make a defensive copy of the passed-in character array, which would be immediately discarded when it is encoded to bytes. This basically sucks given how often String creation happens in a typical Java program.

- Dispense with the whole use of the coder to check for non-equality of Strings, and insist on a character-by-character comparison (possibly from different encodings) for every call to String.equals() where the lengths are the same. Again, this would be a lot of wasted cycles in an extremely common JVM op.

I think the right answer here is to add something to the Javadoc about "modifying the input character array concurrently with String construction invites dragons." :-)


Another way of fixing it would be, whichever character caused 'StringUTF16.compress' to fail (we'd need to return it), make sure that character is kept in the final string (could just assign it at the end, remember it's value and location earlier).

By merging in the code of StringUTF16.compress and StringUTF16.toBytes into one function (and they are both very small), this wouldn't even have to slow things down, once you noticed location 'i' has char 'c' which isn't LATIN1, you make the new buffer, copy 0..i-1 over, then your known non-LATIN1 'i', then i+1..len.

This might be very slightly slower, but would fix (I think) the problem of breaking interning, which feels quite painful.


Clever! I like this! I feel a bit jealous that I didn't think of it :)

It's not quite as easy as it seems, because these methods are "intrinsics". That is, the pure Java code you see is not always the one that gets used; instead, the JIT compiler can use a faster implementation that uses vectorized assembly or whatever. (That's why you see "@IntrinsicCandidate" on compress() and toBytes() in StringUTF16.java)

But I think your idea would also be possible to implement in vectorized assembly, so it still works!

The rule for most things (such as ArrayList) in the JDK is: "if you use race conditions to break this thing, that's your problem, not ours". But String is different: it's one of the few things meant to be "rock-solid, can't break it even if you try", so I think this bug in String would qualify as a potential security issue in their mind: there are many places in Java that trust Strings not to act weird, and some of them are even in native code deep in the guts of the VM.

On the other hand, String is used all over the place so having to introduce a performance regression to fix this bug would be quite painful. All of the other proposed solutions in this thread introduce an extra copy, and an extra pass over the string. Your fix is basically no extra cost, and as a bonus, can be tweaked so each char in the array is read only once.

Which means that the bug can now be fixed "guilt-free", if anyone from the JDK team is reading this thread. Though they have some pretty clever people there too, so they might have thought of it eventually for all I know.


Comparing strings is extremely common, sure.

Is comparing two strings of the same length but different encodings all that common?


Java definitely does "you wrote thread dangerous code, so it's your fault" for APIs not marked as being thread safe.


This is yet another way that running untrusted code inside the same JVM is a terrible mess. There's a lot of JVM state that gets "locked in" on first use (e.g. <clinit>) and a malicious bit of code could corrupt a LOT of shared data (like the post's mentioned string internment zone) even if you sanitize all of your inputs and outputs.

I wonder if you could do something nasty with this bug from inside an IntelliJ plugin...


For what it's worth, Java does at least give some guarantees in case of data races -- the observed value will always be one that was explicitly set by one thread. This is different from most other languages, e.g. in C,C++, unsafe Rust it is UB.

Of course it can and still will result in invalid states.


> Of course it can and still will result in invalid states.

While of course they can't stop you from creating "invalid" states of your own types, whether through a data race or just bad coding - Java's own types should not have invalid states which can come into existence this way.

For example, suppose we've got a Goose type, and it can be Happy or Sad, and when it's Sad it has a Reason, when it is Happy there's no Reason. We, as Java, should not design this type so that it's possible for it to get flipped from Happy to Sad without choosing a Reason. As a result, after a race the Goose might be Happy when you expected Sad, or vice versa, but it can't enter the invalid state where it's Sad but for no Reason.


If you read a double or long while writing to it from another thread, you are not guaranteed to read a value that was ever explicitly set because they update 32 bits at a time rather than atomically.


As per the specification, you are right. As per the actual implementation, no :D it is as cheap, if not cheaper to set it atomically on 64-bit.


If I would design JVM from the scratch, I'd introduce frozen arrays. There are so many array copies inside JDK code, it hurts.

Whether it's possible to retrofit JVM today, I don't know.


Not sure if you are referencing this, but there were a proposal: https://openjdk.org/jeps/8261007


I just added solutions to the empty String challenge in the blog post. This includes a very interesting find from Xavier Cooney, that causes the same problem without involving any concurrency. It instead makes StringBuilder misbehave by throwing an exception at an unexpected place: https://gist.github.com/XavierCooney/e9f6235f05479ac6bf962ca...


It is possible to fix this String constructor implementation without creating a defensive copy of the input array or having a TOCTOU vulnerability.

    // Change this implementation to a loop.
    public String(char[] value) {
        while (true) {
            byte[] temp = StringUTF16.compress(value);
            if (temp != null) {
                this.value = temp;
                this.coder = LATIN1;
                break;
            }
            
            temp = StringUTF16.toBytes(value);
            if (temp != null) {
                this.value = temp;
                this.coder = UTF16;
                break;
            }
        }
    }

    // This implementation stays the same.
    static byte[] StringUTF16.compress(char[] value) { ... }

    // Change this contract and implementation so that it returns null
    // if all characters are below 256, otherwise it returns byte[].
    // The difference is that previously, this function would never return null.
    // Now, we make sure that the function succeeds if and only if the
    // char array *requires* UTF-16 as opposed to Latin-1.
    static byte[] StringUTF16.compress(char[] value) { ... }


An unconditional loop with no guarantee of forward progress may loop indefinitely, and hence, is not a sensible solution to the problem.


It only loops if you modify the string in certain ways partway through the loop. Is that a significant problem? As soon as you stop your indefinite loop of race-condition writes, this loop is guaranteed to finish.


Well it's trading "bad code can populate the program's string intern table with invalid string objects" for "bad code can instantly deadlock the program", which is not much of an upgrade. And wouldn't you need to do this in every function that uses more than 2 or more related mutable objects 1 time each, or uses 1 mutable object more than 1 time? Do you know of any systems that work like this?

This is basically a very poor man's version of software transactional memory. Noticing this is one step on the road to realizing that shared memory concurrency needs cooperative synchronization (and locks are just one way to achieve that), and most important of all, you should strictly limit the number of functions that need to synchronize at all, by strictly limiting the number of shared data objects.

I think the article OP and many in the comments here have taken the wrong lessons from this. I think the real lessons are:

1. In a program containing data races, one cannot assume objects obey their stated invariants.

2. Therefore, security/correctness in a shared memory concurrent system cannot be achieved if there is untrusted/unverified code (i.e., code that may introduce data races).

3. Regardless, it may pay dividends to try harder to learn from 1 and do better at validating input, especially in silently pervasively used shared state (e.g., the string intern table). Unfortunately, I think this will always be best effort.


I get it, you find the infinite loop to be distateful. Here's an alternative: Remove the loop wrapper. Add `throw new ConcurrentModificationException();` at the end. Don't modify the two helper functions. This way, the constructor only gets one chance to try doing the encoding. Change the Javadoc to say that if the array contents are modified while the constructor is executing, the function may (but is not guaranteed to) throw ConcurrentModificationException.


> it's trading "bad code can populate the program's string intern table with invalid string objects" for "bad code can instantly deadlock the program", which is not much of an upgrade.

A lot more harm can come from code doing the wrong thing or producing wrong answers than from deadlocks.


It has to be almost deliberately bad to cause a deadlock.

This solution stops you from having random occasional breakage.


Are you really going to put while(true) around every nearly every access of mutable state in your programs? That is the implication I get from your comment. I really think you should think hard before trying to dig your way out of a hole like that.


It's not very different from grabbing a mutex on the string. Except that someone is more likely to hang a mutex forever than to make this code hang forever. So yes that's fine. Why wouldn't it be?

And "every time a string is copied" is pretty far from "nearly every access of mutable state".

Also unless the string-changing code has a very elaborate timing attack, won't it only make the constructor loop again about half the time at most?


This thread is honestly crazy to me. I've never seen so many people advocate for hot spin loops as a patchwork attempted solution to data races. Never mind have I never seen this recommended in any serious literature, I've never even seen it suggested in a random blog post, or in the worst code bases I've ever worked with (I have worked on some not great code in my professional experience in the past).

And the reason I am talking about mutable state, is because these issues are not specific to strings! Why do you think there's something special about these kinds of data races that make them specific to strings?

I really, genuinely hope you take some time to read more deeply into this topic, and talk with your peers about this some more, before you implement any of these ideas.

The fundamental lesson I think is that data races must be eliminated, and there is no solution that tolerates them or works around them. You must eliminate them from your code.

A key related lesson is that there is no unilateral way to do this for shared mutable state. If there is shared mutable state, 2 or more concurrent threads must co-operate in order to avoid data races (all threads must lock, or use atomics, or fences, etc.). "There is no unilateral solution", means there is no way to loop or lock or fence in just one thread, in order to fix or resolve or work around some other thread that does not.


I didn't advocate it. I said the deadlock risk is not meaningful. You're badly misreading me.


I'm genuinely glad you're not advocating for it. If your reasoning for that is not the same as mine, I hope it's equally critical.


I enjoyed the article, but if I may express a peeve of mine... In the code listings, can we please not use a syntax coloring scheme that makes the comments nearly unreadable? Especially in blog posts like this, where the code deliberately contains numerous explanatory comments. Such low-contrast text slows down my tired old eyes.


> Why is "foo!".equals("foo⁉") false?

I don't really understand this question. They...look different? One is an exclamation mark, and the other is an exclamation mark/question mark combo?


This is not a question.

The paragraph after that get into the implementation details on how have know they are different without comparing the bytes.


in the words of linus, java is a horrible language.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: