Not other than the one linked in the comment above. I have been reaching out to ...

lelanthran · on Feb 13, 2022

The links upthread don't actually explain why a VM + GC can do shared-memory concurrency faster[1].

I don't understand what particular piece of magic makes shared-memory concurrency under a VM+GC faster than a CAS implementation.

[1] I'm assuming a shared-memory threaded model of concurrency, not a shared-nothing message passing model of concurrency.

bullen · on Feb 14, 2022

CAS?

Me neither, but I know it does in practice.

My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.

As I said; for 5 years I have been searching for answers from EVERYONE on the planet and nobody can answer. My guess is that this is so complicated, only a handfull can even begin to grook it, so nobody wants to explain it because it creates alot of wasted time.

The usual reaction is: Java is written in C, so how can Java be faster than C? Well I don't know how but I know it's true because I use it!

So my answer today is: Java is faster than C if you want to share memory between threads directly efficiently because you need a VM with GC to make the Java memory model (which everyone has copied so I guess it must be good?) work!

Here is someone who knows his concurrency and made C++ maps that might be better than TBB btw: https://github.com/preshing/junction

But no guarantees... you never get those with C/C++, I stopped downloading C/C++ code from the internet unless it has 100+ proved users! So stb/ttf and kuba/zip are my only dependencies.

lelanthran · on Feb 14, 2022

> CAS?

https://en.wikipedia.org/wiki/Compare-and-swap

> My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.

I dunno about the GC bits; after all object pools are a thing in C++ so you have a consistent place (getting a new object) where reclamation of unused objects can be performed.

I think it might be down to mutex locking. In a native program, a failure to acquire the mutex causes a context-switch by performing a syscall (OS steps in, flushes registers, cache, everything, and runs some other thread).

In a VM language I would expect that a failure to acquire a mutex can be profiled by the VM with simple heuristics (Only one thread waiting for a mutex? Spin on the mutex until its released. More than five threads in the wait queue? Run some other thread).