Agreed, I'm curious as well. We load tested with real-clients faux-users, up to ...

bullen · on Feb 13, 2022

The data is here: http://fuse.rupy.se/about.html

Under Performance. Per watt the fuse/rupy platform completely crushes all competition for real-time action MMOs because of 2 reasons:

- Event driven protocol design, averages at about 4 messages/player/second (means you cannot do spraying or headshots f.ex. which is another feature in my game design opinion).

- Java's memory model with atomic concurrency parallelism over shared memory which needs a VM and GC to work (C++ copied that memory model in C++11, but it failed completely because they lack both VM and GC, but that model is still to this day the one C++ uses), you can read more about this here: https://github.com/tinspin/rupy/wiki

These keep the internal latency of the server below maybe 100 microseconds at saturation, which no C++ server can handle even remotely, unless they copy Java's memory model and add a VM + GC so that all cores can work on the same memory at the same time without locking!

You can argue those points are bad arguments, but if you look at performance per watt with some consideration for developer friendlyness, I'm pretty sure in 100 years we will still be coding minimalist JavaSE (or some copy without Oracle) on the server and vanilla C (compiled with C++ compiler gcc/cl.exe) on the client to avoid cache misses.

Energy is everything!

lelanthran · on Feb 13, 2022

> - Java's memory model with atomic concurrency parallelism over shared memory which needs a VM and GC to work

Do you have a link that explains this bit?

belter · on Feb 20, 2022

I think this one helps: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

bullen · on Feb 13, 2022

Not other than the one linked in the comment above. I have been reaching out to EVERYONE, and nobody can explain this to me, but I'll implement it myself soon so I can explain it.

lelanthran · on Feb 13, 2022

The links upthread don't actually explain why a VM + GC can do shared-memory concurrency faster[1].

I don't understand what particular piece of magic makes shared-memory concurrency under a VM+GC faster than a CAS implementation.

[1] I'm assuming a shared-memory threaded model of concurrency, not a shared-nothing message passing model of concurrency.

bullen · on Feb 14, 2022

CAS?

Me neither, but I know it does in practice.

My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.

As I said; for 5 years I have been searching for answers from EVERYONE on the planet and nobody can answer. My guess is that this is so complicated, only a handfull can even begin to grook it, so nobody wants to explain it because it creates alot of wasted time.

The usual reaction is: Java is written in C, so how can Java be faster than C? Well I don't know how but I know it's true because I use it!

So my answer today is: Java is faster than C if you want to share memory between threads directly efficiently because you need a VM with GC to make the Java memory model (which everyone has copied so I guess it must be good?) work!

Here is someone who knows his concurrency and made C++ maps that might be better than TBB btw: https://github.com/preshing/junction

But no guarantees... you never get those with C/C++, I stopped downloading C/C++ code from the internet unless it has 100+ proved users! So stb/ttf and kuba/zip are my only dependencies.

lelanthran · on Feb 14, 2022

> CAS?

https://en.wikipedia.org/wiki/Compare-and-swap

> My intuition tells me the VM provides a layer decoupled from the hardware memory model so that there is less "friction" and the GC is required to reclaim shared memory that C++ would need to "stop the world" to reclaim anyhow! (all concurrent C++ objects leaks memory, see TBB concurrent_hash_map f.ex.) That means the code executes slower BUT the atomics can work better.

I dunno about the GC bits; after all object pools are a thing in C++ so you have a consistent place (getting a new object) where reclamation of unused objects can be performed.

I think it might be down to mutex locking. In a native program, a failure to acquire the mutex causes a context-switch by performing a syscall (OS steps in, flushes registers, cache, everything, and runs some other thread).

In a VM language I would expect that a failure to acquire a mutex can be profiled by the VM with simple heuristics (Only one thread waiting for a mutex? Spin on the mutex until its released. More than five threads in the wait queue? Run some other thread).

azth · on Feb 13, 2022

This is fantastic information.