You may presume so, but you would be wrong. They are technically (also) lock-fre...

KMag · on Dec 2, 2020

Interesting, though the overhead of nanosleep vs. futex wait with non-NULL timeout should be dwarfed by the context switch. I can see where if you have extreme latency and/or throughput constraints on the writer side of the ring buffer, you couldn't afford the extra atomic read and occasional futex wake on the writer side to wake the consumer.

> but being written by only one process, ownership of cache lines never changes, so hardware locks are not engaged.

Which cache coherency protocol are you referring to where cache lines are "locked"? I'm aware that early multiprocessor systems locked the whole memory bus for atomic operations, but my understanding is that most modern processors use advanced variants on the MESI protocol[0]. In the MOESI variant, an atomic write (or any write, for that matter) would put the cache line in the O state in the writing core's cache. If we had to label cache line states as either "locked" or "unlocked", the M, O, and E states would be "locked", and S and I would be "unlocked".

Your knowledge of cache coherency protocols is probably much better than mine, so I'm probably just misinterpreting your shorthand jargon for "locked" cache lines. By "hardware locks are not engaged" do you just mean that the O state doesn't ping-pong between cores?

[0] http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

ncmncm · on Dec 3, 2020

> * .. .you just mean that the O state doesn't ping-pong between cores?*

Correct. My terminology was sloppy, sorry. I have no experience of this "context switch" you speak of; my processes are pinned and don't do system calls or get interrupted. :-)

The writing core retains ownership of the cache lines, so no expensive negotiation with other caches occurs. When the writer writes, corresponding cache lines for other cores get invalidated, and when a reader reads, that core's cache requests a current copy of the writer's cache line.

The reader can poll its own cache line as frequently as it likes without generating bus congestion, because no invalidation comes in until a write happens.

There is a newish Intel instruction that just sleeps waiting for a cache line to be invalidated, that doesn't compete for execution resources with the other hyperthread, or burn watts. Of course compilers don't know it. I don't know if AMD has adopted it, but would like to know; I don't recall its name.