Atomic instructions are not locks on the instruction level (the lock prefix on x...

feffe · on Sept 13, 2021

It's a lock associated with the cache line that the atomic operation operates on. This is because they are built on top of the cache coherency mechanism, a synchronous blocking operation on the hardware level to implement an asynchronous mechanism on the software level.

The big frustration with today's multi core CPUs is that there's simply not efficient way to communicate using message passing mechanisms. I think this is something the hardware guys should focus on :-) Provide an async mechanism to communicate between cores, not relying on the cache coherency.

gpderetta · on Sept 13, 2021

There is no lock associated with the cache line.

The coherency protocol guarantees that a core can own in exclusive mode a cache line for a bounded number of cycles, this guarantees forward progress and it is different from an unbounded critical section. It has also nothing to do with the lock prefix and also applies to normal non atomic writes.

What the lock prefix does is delay the load associated with the RMW so that it is executed together with the store before the core has a chance to lose the ownership of the line (technically this can also be implemented optimistically with speculation and replays, but you still need a pessimistic fallback to maintain forward progress).

A message passing feature would be nice, but it would be necessarily non coherent which means you can only use it to pass serialised values, not pointers to other threads.

An alternative more usable solution would be to aggressively speculate around memory barriers as the memory subsystem is already highly asynchronous.

feffe · on Sept 13, 2021

The effect is the same. If someone touch the cache line, it's evicted from all other caches that has it, triggering a cache miss when other cores touch it. Everyone knows this. I just think it's a bit depressing if you try to optimize message passing on many-core CPUs you'll realize you can't make it fast. No-one has been able to make it fast (I've checked Intel message passing code as well). If you get more than 20 Mevents through some kind of shared queue, you are lucky. That is slow if you compare to how many instructions a CPU can retire.

So all these event loop frameworks try to implement an async behaviour ontop of a hardware mechanism that is synchronous in it's core. The main issue is that the lowest software construct, the queue used to communicate is built on this cache line ping pong match. What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them. Just remove the inefficient atomic operations way of synchronizing the data between CPUs. Send them using some kind of network pipe instead :-) As you say the memory subsystem is really asynchronous.

I'm convinces this is coming, it should just have arrived 10 years ago...

gpderetta · on Sept 13, 2021

>What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them

Which means that the sender will need to pessimistically flush anything that the receiver might need to access, which is likely more expensive than optimistic cache coherency.

Non-CC systems were a thing in the past, and still are for things like GPUs and distributed systems, but are much harder to program and not necessarily more efficient.

As the saying goes there are only two hard things in programming...