Not other than the one linked in the comment above. I have been reaching out to EVERYONE, and nobody can explain this to me, but I'll implement it myself soon so I can explain it.
My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.
As I said; for 5 years I have been searching for answers from EVERYONE on the planet and nobody can answer. My guess is that this is so complicated, only a handfull can even begin to grook it, so nobody wants to explain it because it creates alot of wasted time.
The usual reaction is: Java is written in C, so how can Java be faster than C? Well I don't know how but I know it's true because I use it!
So my answer today is: Java is faster than C if you want to share memory between threads directly efficiently because you need a VM with GC to make the Java memory model (which everyone has copied so I guess it must be good?) work!
But no guarantees... you never get those with C/C++, I stopped downloading C/C++ code from the internet unless it has 100+ proved users! So stb/ttf and kuba/zip are my only dependencies.
> My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.
I dunno about the GC bits; after all object pools are a thing in C++ so you have a consistent place (getting a new object) where reclamation of unused objects can be performed.
I think it might be down to mutex locking. In a native program, a failure to acquire the mutex causes a context-switch by performing a syscall (OS steps in, flushes registers, cache, everything, and runs some other thread).
In a VM language I would expect that a failure to acquire a mutex can be profiled by the VM with simple heuristics (Only one thread waiting for a mutex? Spin on the mutex until its released. More than five threads in the wait queue? Run some other thread).