This is really an unscientific claim but I ran a hand crafted/hacked benchmark just to get a feeling for the numbers. For 5 to 35 character Strings, == is 20 to 40 times faster than String.equals().
Given that s1.equals(s2) if and only if s1.intern() == s2.intern() (assuming you haven't filled the string table), then this looks like an opportunity for a significant optimization.
Before doing this, I had hoped that String.equals might check if both were "interned" and shortcut the character by character comparison if this was the case by just comparing references. But interpreting the results of my rough benchmark would suggest this isn't what is happening which would agree with the source provided for the String.equals method.
Java String comparison is absolutely ubiquitous so I would have expected that an optimization like this might have been considered?
Having said that, the supplied rt.jar source also suggests that the String.hashCode() computation isn't cached/memoized. This strikes me as odd given that Strings are immutatable and Strings are one of the most common key type for Maps.
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
...
So if both strings are interned, this shortcut will work. I guess, your benchmarks were wrong. Microbenchmarks in Java are hard to implement correctly.
hashCode is cached as well in field `hash` and lazily computed:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
Everything there is optimized till the last cycle, I'm sure about that.
I hadn't spotted that in the hasCode() implementation. Thanks, that makes sense.
Regarding .equals(), indeed that shortcut exists but it didn't help my benchmark because it spent nearly all it's time comparing Strings which were not equal.
I switched the benchmark to compare all equal (interned) Strings and even then there .equals() is about half the speed of the reference equality check. I guess because of the method call overhead?
Yes I would have been sure that everything to do with Strings would have been optimized to the last cycle hence my surprise. I'm not sure about the state of Java intrinsics with the Oracle JDK but I would have thought that most of the String methods would be handled by intrinsics and be very heavily optimized.
Edit: correction - indeed microbenchmarking is tricky and I had a bug. If the (interned) Strings are equal then .equals() and reference comparison run at roughly the same speed.
I don't think that JVM will do magic and execute something different from actual sources, even for String. On the other hand, optimizing those bytecodes to compile to perfect machine code or playing with code to produce bytecode which will be correctly optimized by JIT is more reasonable approach, not only java.lang.String would benefit from those optimizations, but any other developer with similar code.
Method call overhead is real and calling equals method is slower compared to reference comparison, but eventually JIT might decide to inline this method call, so there'll be no overhead after that. Anyway it's unlikely, that those difference will be noticeable in real code, IMO.
You should compare the cost of one (or two) .intern() plus '==' against a single .equals()
For example, if your code needs 10 .intern() and then performs 100 '==', that is possibly going to be faster than performing the equivalent 100 .equals() (or maybe not)
But if you are comparing a single couple of strings, .equals() is always going to be faster.
Not to mention that if you .intern() random strings, you risk running out of PermGen space. Even in the case of 100 comparisons among 10 strings, you may be better off using a custom HashMap.
I'm not suggesting that .equals() should be implemented by 2 inter() operations followed by a reference comparison. But that the implementation check whether the pair of Strings happened to be already .intern()ed, and then use the short circuit of a reference comparison.
A programmer could then decide to intern _some_ non-constant Strings in their application if it made sense (i.e. they were frequently used in comparisons or as hash keys).
And I don't think you can't run out of permgen as the String pool is a fixed size - unless I am misreading that section of the article.
However now that I've thought about a bit more, I think the flaw in the idea is unrelated to your points but that currently it's not cheap to determine if a particular String is interned. The StringTable implementation would have to be changed otherwise it would require adding storage overhead to every String which would not be acceptable. This is also probably why .hashCode() doesn't memoize it's results.
Given that s1.equals(s2) if and only if s1.intern() == s2.intern() (assuming you haven't filled the string table), then this looks like an opportunity for a significant optimization.
Before doing this, I had hoped that String.equals might check if both were "interned" and shortcut the character by character comparison if this was the case by just comparing references. But interpreting the results of my rough benchmark would suggest this isn't what is happening which would agree with the source provided for the String.equals method.
Java String comparison is absolutely ubiquitous so I would have expected that an optimization like this might have been considered?
Having said that, the supplied rt.jar source also suggests that the String.hashCode() computation isn't cached/memoized. This strikes me as odd given that Strings are immutatable and Strings are one of the most common key type for Maps.