CAS or LL/SC is straightforward enough (although you have to be aware in your us...

CAS or LL/SC is straightforward enough (although you have to be aware in your use of them of the differences between processors in terms of whether they lock cache lines, or have exclusive reservation granules, or per logical-core locks, etc, and there is one other minor complication to consider, the presence or absence of contigious double-word CAS or LL/SC), but memory ordering behaviour and support varies significantly across processors.

Intel in that regard are a pain, because they have a mandatory, built-in full memory barrier in their atomic operations. ARM does not, and I see the freelist on ARM running about 25% faster (relatively speaking) than Intel, because of it.

If you look at the first two bars (first is the new GCC atomic instrincs, second the old GCC sync intrinsics) in the one-core chart from these two gunplots, the first gnuplot being ARM32 and the second a Core i5,

http://liblfds.org/pages/images/liblfds710_freelist_push1_th...

You will see on Intel they're level, and on ARM, the atomic bar (the first bar) is about 25% higher.

The new GCC atomic instrincs only issue memory barrier when told to, whereas the old sync instrincs normally (e.g. on most platforms - the docs are a little nebulous) issue memory barriers.

The freelist doesn't need a memory barrier on pop, but on Intel, you get one anyway, and on ARM, with the sync instrincs, you get one anyway. The atomic instrinics also issue on Intel, because Intel forces it to happen, but they do not issue on ARM. I think this gives the 25% performance improvement.