Hacker News new | past | comments | ask | show | jobs | submit login
Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
366 points by noego on Nov 20, 2019 | hide | past | favorite | 119 comments



If cache coherence is relevant to you, I strongly recommend the book “A Primer on Memory Consistency and Cache Coherence”. It’s much easier to understand the details of coherency from a broader perspective, than an incremental read-a-bunch-of-blogs perspective.

I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.

Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).

If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.


“A Primer on Memory Consistency and Cache Coherence” is part of the "Synthesis Lectures on Computer Architecture" which are 50-100 page booklets on topics related to HW components. All the booklet PDF's are available online [1].

edit: only those PDF's with a checkmark are available as PDF to download, the rest can be bought. Quite a few actually available for download.

[1] https://www.morganclaypool.com/toc/cac/1/1


Disagree on that last part. If more programmers understood cache coherency, maybe their programs would not run like a giant turd.


Are the details really helpful for performance work? MOSI, MOESI, MERSI, MESIF etc are 99% irrelevant to having the right metal model. "Dirtying cache lines across different cores/threads is slow" is most of what you need to know. Within the same core the coherency protocols between levels of cache is not really visible at all to software even as varying performance artifacts.

Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture and making a cross cpu cache miss slightly less slow could just maybe be helped by the detailed understanding of the machine model, but by that time you're already far in the not-giant-turd territory (at least if you're optimizing the right thing).

Of course computer architecture is fascinating and fun to learn about.


> Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture

Or if you work in finance, mining, oil industry, medical, biosciences and countless other fields where you need to get good performance out of your hardware. Yes, even if you use GPUs, they're no magic bullet, they also have their architectural bottlenecks, strengths and weaknesses.

Or if you care about power consumption.

There are a lot of reasons to optimize hot paths and inner loops. CPU single core performance isn't improving much anymore, and we need to make better use of what we have.


I was trying to say that even in those cases you almost never benefit from knowing the details of cache coherency.


That's a very bold statement.

You could just as well for example say that front end Javascript developer almost never needs to understand event callbacks or how DOM works.

If you write multithreaded high performance code, yeah, you do need to know about cache coherency at varying levels of detail. Sometimes rough rules of thumb work, not too often you need to understand all those annoying performance destroying details that leak through cache abstraction.


DOM and event callbacks are basic bread and butter for JS developers and in web apps it's essential to understand how DOM works and how to minimize DOM changes when doing performance work. Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Or are we having a terminology confusion and you use the coherency term for software visible performance characteristics of caches generally? I do agree that understanding cache effects and their intersection with multiprocessing is generally important in perf work. As is understanding the architected memory model, which tells you what you can and can't rely on semantically.


> Can you come up with an example where a oil or finance sector developer needs to understand MESIF?

Yes. When they're writing high performance multithreaded analysis software and they're deciding which cache lines to write to and which only to read from. Those lines you only read from can be in S (shared) or F (forward) state.

And why is this important? Performance characteristics of a line entering in M (modified) state are pretty bad, if the line is shared between multiple cores.

Perhaps you'll also want to do all reads at once, knowing the line will more likely remain in F state for short periods, instead of bouncing S/F line state between CPU cores?

NUMA comes also to play, making this mistake even more costly. You really want to keep inter-core (especially NUMA socket!) communication to the minimum.

Of course, you could say you don't strictly need to understand MESI (or MESIF), but it really helps understanding why you do things certain way and reasoning about multi-core performance. The thing is, you can say same "you don't need to know this" about a lot of "low level details" in software trade.

Just like you need to understand DOM as a front-end developer to minimize DOM changes, even if you don't access it directly.

In cache coherency case it's analogously about reducing unnecessary multicast and broadcast messages sent between cores.


Thank you for writing a scenario and how it relates to coherency mechanisms.

So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.

But yeah I like reading details about microarchitectural details and other computer architecture topics, and am symphatetic to the point of view that knowing the "why" is nice. Just like I find it interesting to read about how DOM APIs are implemented in browsers and why they are hard to make faster...


> So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.

The hidden thing behind all this is that even if the data is just read-shared, it can still generate traffic between cores and sockets.

Since these communication links are a shared resource [0], doing things wrong hurts performance in unrelated code and cores. Just because of storm of cache coherency packets is being sent between cores.

So yeah, you really do want to minimize this to maximize performance and scalability across the whole system!

[0]: In Intel's case, this shared resource is ring bus inside CPU socket and QPI between CPU sockets.


True, high validation traffic of read-shared data can indeed have effects in some cases, especially on multi socket systems, and it can sometimes be beneficial to have private copies of even read-only data for different threads if the data is small.


>> Dirtying cache lines across different cores/threads is slow

Very succinct and correct. This is particularly pernicious with atomics, which people use for lock-free stuff such as queues and work stealing thread pools. If atomics used by different threads/cores share a cache line and you mutate them a lot, perf gets instantly fucked, sometimes worse than if you used a mutex in the first place. And if you don't benchmark, you aren't going to notice.


Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details - as these are free to be evolved.

Of course performance matters, but surely having performance tests, rather than trying to second guess what the whole stack below you might be doing, is

1. more efficient 2. more accurate 3. more likely to detect changes in a timely way.

That's not to say, you shouldn't be curious and deep understanding isn't a good thing.

Just saying understanding inside-out the abstraction you are working with ( eg Java Memory Model ) it's performance characteristics ( from real world testing ) - is more important than some passing knowledge of real world CPU design.

This app I am using right now is in a webbrowser - not sure how understanding cache coherency helps in a single threaded javascript.


> Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details

In my opinion an important part of being a good programmer is understanding - at least at a broad level - how the set of abstractions you're working on top of work.

At the end of the day, our job is making computer hardware operate on data in memory. The more that we forget that, and think about computing as some abstract endeavor performed in Plato's heaven, the more tendency we have for bloat and inefficiency to creep into our various abstractions.

In other words, I think it's better to think about abstractions as a tool for interacting with hardware, not as something to save us from dealing with hardware.


Eh, idk. Most programs that run like a giant turd do it because they load 15 megabytes of Javascript libraries to call two functions, or something to that effect. Computers are so fast now that you really need to be doing something unbelievably stupid for things in consumer programs to not be instantaneous.


Like start Firefox? I have no idea why it takes multiple seconds to bring up a blank window when starting a comparably useful program like Claws Mail is absolutely instantaneous.


To be honest, yes, generally programs that don't start instantaneously do something very unnecessary and stupid that has nothing to do with actual CPU performance. As in they read thousands of tiny files or they decompress files or they wait for some sort of answer from the network, or they make 10k+ expensive API calls or all of the above. Firefox is so big it might fall into the "all of the above" category, but to answer that question definitively you'd have to analyze what Firefox actually does.

And of course, making the decompression 15-20% faster by optimizing the decompression code (which is usually not even written by the developers of said software but just some external library) won't even make a difference because 20% less than 5 seconds is still 4 seconds which is way too long for a program to start. Instead using a different compression algorithm that increases file size by 25% but decompression speed by 10x would actually start solving the problem, with the next step being to ask why the program needs to read so much damn data at the start in the first place.

But since NVMe SSDs and Intel CPUs with very high boost clocks are quickly becoming the norm now even for laptops I don't see much of that happening, because Firefox starts pretty quickly (~1 second) on those machines.


Fwiw, Firefox stores almost everything it will need on startup in one large file (omni.ja) precisely to avoid the "thousands of tiny files" problem. That data is uncompressed, precisely to avoid the decompression problem. The low-hanging fruit has largely been picked.

As for why so much data needs to be read... I just checked, and on Mac the main Firefox library (the executable itself is mostly a stub) is 120MB. So that's going to take a second or three just to read in at typical HDD speeds (faster on a good SSD), and then the dynamic linker has to do its thing on that big library, which is not instantaneous either.


How big is the Claws Mail binary? A large part of the cost of starting up Firefox, or other large programs, is just getting the binary from disk into memory plus all the work the dynamic linker needs to do once it's there.


Most engineers don't write code with hard performance constraints. Game devs probably need to be fighting to get every frame.

For the bulk of the eng I work with the concept of StoreLoad reordering on x86 would be an academic distraction.


> Most engineers don't write code with hard performance constraints.

Only because most software "engineers" don't give two shits about the actual user experience of their glacially slow over-engineered garbage.


99.999% of performance issues in development are related to the abstract model, and not the underlying implementation details. Things such as searching in a big unordered list, repeating the same work, stupid SQL queries ect.


The way I usually follow it is 1) Is this OK? For example It is only called once in the code in an error path? 2) Did I do something silly? For example, I left in some extra debug code. 3) is the code doing something silly? For example extra work, dragging in extra data that is not needed? 4) is the code written in a way that causes extra work, re-init on inner loop, loop hoisting, etc 5) Is there a better algorithm for this? Binary search vs linear? 6) is the code doing too many things at once. De-serialize it re-serialize it 7) Is there a better abstraction for this? Monolith code vs microservice? 8) Is there any compiler switches that are 'easy' wins? Packing, -O2, etc? Usually not. 9) What sort of memory arch does this machine have? It varies between machines even for the x86 world. For example if I rearrange my memory structures do I use less cache and emit less code. The cache line on this box is x bytes and on that box is y bytes. Some languages do not lend themselves well to this one due to the way their VM/ISA is written. 10) is there some asm that would help? usually not.

Usually 1-7 are all you need. If you get all the way to 10 you are in the deep end for most things.

Big O is good for many things. But in reality big O is O(N)+C where the C can get you. That is where the later steps help. But usually you can totally ignore it. Most of the big wins I get are just from flipping out a bad search for a O(log(n)) search, or removing 'extra' code.


On the BigO notation, you have to remember is that it's an upper bound on the runtime. There is no guarantee other than the growth of the function.


Oh I agree. It is just that +C bit. What happens when your O(nLog(n)) basically just trashes the cache because of your data structures? Yet maybe something 'worse' would run better because of the way cache works? That +c bit can sometimes turn into its own big O. It is a nasty little gotcha when it does happen.

It is a good framework to get you in the ballpark of the correct thing. Even usually 99% of the time it is right. But sometimes the arch bites back due to your data.


The worst piece of code I had to optimize was running a database search with an O(n^3) algorithm.

Fixing that did not require knowledge of cache lines.


No, because most software development houses prioritize getting something that works over performance wankery.


Performance is a feature.


But not a feature a substantial number of paying customers care about.


We'll never know if that's true, because they were never given the option.


The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.

While I agree that the details of StoreLoad are likely a distraction the big picture concepts of cache coherence presented in this article are table stakes for performant systems.


The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.

This is true for literally every hard problem in dev though, and the implication that you need to grasp everything just in case you need it is silly. The problem space in compsci is too big to know everything. We have to choose.


Part of the reason this kind of blogs exist is to give you more information to choose properly.


While that might be true in practice, I do think not knowing when you transition between those two indicates a failing by your coworkers/mentors.

I encourage people I work with to read many of the books in this book series. I particularly encourage them to read “Hardware and Software Support for Virtualization”, since it’s basically a book on their job.


Just to be clear, the book series you are referring to is "Synthesis Lectures on Computer Architecture" published by Morgan-Claypool, right?

If you had to rank them in importance for the average engineer, how would you rank them?


Money is another way to measure performance. People that understand these low level principles can write a backend service in C/C++/Java/Erlang/Whatever that can do the work of hundreds of instances of a service written in higher level languages that ignore these concepts. That's not to say that it's never wise to use higher level languages, but these principles can and do save mature established businesses real money.


True, but if my bad python implementation can do the job on one old single core computer with plenty of CPU left over what do I care - in fact because python is generally more productive for small programs it is a lot cheaper. Most programs don't actually handle so much data that they need every bit of performance.

Of course the minority that do need performance are where the real engineers are needed.


I'm writing a little C++/Qt application to visualise a fairly small amount of data.

The lag is already noticeable. If I don't pay attention to performance, I know the end result will be slow and unpleasant to use.


That is true, but you are likely better off using a library than focusing yourself on cache optimization.


Using a library (specifically, taking advantage of Qt's QVariant and view/model framework), is what triggered most of the slowdown.


I was just in a room with a guy who claims he "takes pride in his work" so everything has to be as fast as technically possible., use the least RAM down to KBs, never write to a disk if possible, etc. The problem is the schools. Academics have no concept of who is paying for everything around them its like asking a welfare recipient to teach you about budgets.


I disagree on this. More programmers should understand the performance characteristics of the abstraction layers that they rely on. Else we have the case of jerk programmers who insist on redesigning / rewriting / refactoring to optimise for CPU caches while still running the apps via a bunch of docker containers each based of the centos image to run one simple binary that probably needs only glibc.


Maybe but it feels like most are stuck in environments which will do bad things to their cache coherence. It's fine if you are doing some data processing in C. If you are using .NET with a bunch of magic libraries or Javascript or whatever, sure it will help, but to actually make an impact you have to be very careful.


I agree with you Jonathan but am wondering, will Jai help programmers write programs with better cache coherency even if said programmers don’t understand cache coherency well? Or is that orthogonal to the goals of Jai?


Like any reasonable language, I imagine JAI will allow a developer to write software with the cache in mind. It will not force them to, nor will it prevent them from doing otherwise.


If it still plans on wrapping SOA in something that looks in use like AOS, it could make people less aware of how their code is impacting cache (would now have to look at the definitions instead of seeing the array form at point of usage). But if it is enough more ergonomic to write AOS code it might still be worth it and increase uptake.


The developer would still need to understand enough to be able to choose between them though, even if using them is identical.


Only if he releases it


Could you explain what the DEC Alpha did differently here?

It was before my time :)


It's not what DEC Alpha did, but what it didn't do. ;-)

The issue that comes up on Alpha is this code:

  thread1() {
    x = …; // Store to *p
    release_barrier(); // guarantee global happens-before
    p = &x; // ... and now store the p value.
  }
  
  thread2() {
    r1 = p; // If this reads &x from thread1,
    r2 = *r1; // this doesn't have to read the value of x!
  }
The Alpha's approach to memory was to impose absolutely no constraints on memory unless you asked it to. And each CPU had two cache banks, which means that from the hardware perspective, you can think of it as having two threads reading from memory, each performing their own coherency logic. So you can have one cache bank reading the value of p who, having processed all pending cache traffic, saw both the stores, and then you can turn around and request the other cache bank to read *p who, being behind on the traffic, hasn't seen either store yet.

Architectures with only one cache bank don't have this problem. Other architectures with cache banks feel obligated to solve the issues by adding extra hardware to make sure that the second cache bank has processed sufficient cache coherency traffic to not be behind the first one if there's a dependency chain crossing cache banks.


So on every other platform than alpha thread2 works correctly without any barrier?

Does this mean when you use double-checked locking on p on non-alpha systems, you do not need any kind of synchronization on the fast path where p is initialized?

So this would be correct?

    if (!p) {
      T *x = new T;
      release_barrier();
      if (!compare_and_swap(p, 0, x))
        delete x;
    }


IMO, the various blogs and tutorials out there that help to make sense of the C++11 memory model make better tutorials than the Linux kernel's own shenanigans.

The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.

https://preshing.com/20140709/the-purpose-of-memory_order_co...


> The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.

yes and no. Alpha is relevant because it is the only architecture where consume requires an explicit barrier, but then again, I think the revised C++11 memory model might not even be fully implementable on Alpha;

Consume primarily exist because acquire is very expensive to implement in traditional RISCs like POWER and 32 bit ARM, while consume can be implemented with a data or control dependency. Aarch64 has efficient acquire/release barriers, so it is less of an issue.

/pedantic


No, that's not pedantic, its a fair point.

acq/rel remains efficient on a platform that doesn't need barriers to implement it (ie, x86).

consume remains efficient on a platform that doesn't need barriers to enforce a dataflow dependency (ie, ARMv7).


C++11/C11 are not implementable on Alpha CPUs that don't have BWX extensions, BTW.

This is because it requires individual bytes to be accessible in atomic way between multiple threads, which without BWX is not possible on Alpha, which started out with minimum 32bit read/writes.


yes, but even beyond that, I think think there were changes and/or clarifications to the semantics of releaxed atomics that made the straightforward no-barriers-at-all implementation on alpha non-conformant strictly speaking. But I might be misremembering.


memory-barriers.txt from Linux is a lovely way to be introduced to memory barriers in general, and the Alpha memory model in particular.

https://github.com/torvalds/linux/blob/master/Documentation/...


Man I just LOVE how the linux kernel can have such detailed and important information in a simple TEXT file. Fantastically functional and prove they focusing on the right stuff.

At my previous company. I would have to spend the better part of a day to get my "documentation" in the "correct" Confluence Style And Manner. They were adamant THAT'S were the value is, to have documentation is the most beautiful and absurd style" and double-linked structure. You would have to block out a day or two in your scrum(what nonsense) just to focus on your documentation. And this is not some sort of important* software like Linux or Banking... but a stupid website.


You make it sound like CPU caches are the only caches around. I deal with higher-level caching a lot, and I'm not writing an OS. Is your book recommendation still useful for me?


A good, introductory, high-level overview of what is going on with cache coherence... albeit specific to x86.

ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).

Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.

------

I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.

As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.


I don't see anything x86 specific on the article. It focuses on cache coherency, which is applicable to ARM and POWER, and there isn't much about memory model.

Even the description of MOESI is just an introduction and, as the article mentions, actual systems use more complicated protocols.

Edit: if anything, the misconception is that memory barriers have anything to do with cache coherency.


> ARM and POWER have added acquire/release model instructions

They have implemented the acquire-release consistency model since day one (or, the day they started supporting multi-processors). Yes, there are some subtleties there that have in some cases been tightened later on, e.g. multi-copy atomicity.


IIRC, there was a big stink because ARM and POWER historically implemented consume/release semantics, which is very slightly more relaxed than the now de-facto standard acquire/release semantics.

ARM and POWERPC CPU devs worked very hard to get consume/release into C++11, but no compiler writer actually implemented that part of the standard. As such, consume/release can be safely forgotten into the annals of computer history (much like DEC Alpha's fully relaxed semantics)

Then in ARM8, ARM simply added LDAR (Load-acquire) and STLR (Store-release) instructions. https://developer.arm.com/docs/100941/0100/barriers . So the ARM CPU how fully supports the acquire/release model. Apparently IBM's POWER instruction set was similarly strengthened to acquire/release (either POWER8 or POWER9).

ARM / POWER "normal" loads and stores are still consume/release semantics. But compilers can simply emit LDAR (load-acquire) for the stronger guarantee.

----------

I remember at least one talk that showed that consume/release is ideal for things like Linux's RCU or something like that (that acquire/release is actually "too strong" for RCU, and therefore suboptimal). But because compiler-writers found consume/release too hard to reason about in practice, we're stuck with acquire/release.

It seems like the C++ standard continues to evolve to push for memory_order_consume (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p075...), but all the details are still up for discussion.


> ARM and POWERPC CPU devs worked very hard to get consume/release into C++11

AFAIR Paul McKenney was the primus motor, and the motivation was largely RCU. Then again, McKenney also worked for IBM at the time and certainly had an interest in pushing a model that mapped well to POWER.

But it turned out to be both somewhat mis-specified and hard to implement cleanly, so most compilers just implemented it as an acquire.

As you mention, there is ongoing work to fix it.

As for ARM, it seems the big thing they've done since the initial release of ARMv8 is to banish non-multicopy atomicity. See https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pd...


> As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

Linux relies heavily on performant RCU for scalability, which a pure acquire/release SW programming model can't support.


There must be some kind of communication error going on. I don't know much about RCU, so I just pulled up this webpage:

https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

In it is:

> The rcu_read_lock() and rcu_read_unlock() primitive read-acquire and release a global reader-writer lock.

Seems like RCU-operations in the Linux kernel are defined in acquire-barrier and release-barrier terms. I heard a while ago that RCU could be discussed in terms of release-consume semantics (which are slightly faster but harder to understand...) but very few people understand release-consume.

As such, release-acquire is probably the memory model of the future. I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

---------

Are you perhaps confusing "acquire-release" semantics (which is a memory-barrier / cache coherence principle) with spinlocks perchance? Acquire-release seems to be the "Fastest-practical" memory consistency model. (Since Relaxed doesn't work, and release-consume is too confusing)

For more info on acquire-release, Preshing's blogposts are great: https://preshing.com/20130922/acquire-and-release-fences/


> I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

What about Total Store Ordering (TSO), which is what e.g. the obscure and rare x86(-64) architecture implements (and SPARC as well)?


That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Well, that is until you enable gcc -O3 optimizations, which will move memory around, merge variables together, and other such optimizations that will follow the acquire-release model instead of TSO. Remember that the compiler has to consider the memory-consistency model between registers and RAM (when are registers holding "stale" data and need to be re-read from RAM?)

-------

The thing is, acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon. C++11, ARM, POWER9, CUDA, OpenCL have moved onto acquire-release semantics for their memory model.

Next generation PCIe 5.0, CXL, OpenCAPI, are all looking at extending cache-coherence out to I/O devices such as NVMe flash and GPUs / Coprocessors. I'm betting that Acquire/release will become more popular in the coming years. TSO is too "strict" in practice, people actually want their reads-and-writes to "float" out of order with each other in most cases, especially when you're talking about a PCIe-pipe that takes 5-microseconds (20,000 clock-ticks!!) to communicate over.


> That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Yes, in a way it's a race to the bottom; code that works on TSO hw works on acquire-release hw, but not the other way around. There's only two ways to combat this race: education, and using concurrency libraries written by people who know what they're doing.

> acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon

Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

To some extent I think that's a failing of the C++11 model. Instead of choosing one (sane) model, they made people choose between an array of models with subtle semantics. That's what the recent formal Linux kernel model did, although that's not ideal either, with the requirement to not be too different from the previous informal description and boatloads of legacy code. See http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf

In general, it seems to me that progress is being made in formal memory models, and I hope that in some years time there will be some kind of synthesis giving us a model that is both reasonably easy to implement in hw with good performance, easy enough to reason about, as well as formally provable. We'll see.


> Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

Well, nothing will ever be "officially" blessed as the one true model. As the saying goes: we programmers are like cats, we all will be moving off in our own direction, doing our own thing.

Overall, I just think that "programmer culture" is beginning to settle down on Acquire-release semantics. Its just a hunch... but more-and-more languages (C++, CUDA), and systems (ARM, POWER, NVidia GPUs, AMD GPUs) seem to be moving towards Acquire-release.

And in the next few years, we'll have cache-coherency over PCIe 4.0 or PCIe 5.0 in some form (CXL or other protocols on top of it). A unified memory model across CPU, DDR4 RAM, the PCIe-bus, and co-processors (GPUs, FPGAs, or Tensor cores), and high-speed storage (Optane and Flash SSDs over NVMe) is needed.

The community is just a few years out from having a unified memory model + coherent caches across the I/O fabric. Once this "defacto standard" is written, it will be very hard for it to change. That's why I think acquire-release is here to stay for the long term. Its the leading memory model right now.


Keep in mind that even when the underlying hardware implements TSO, what programming languages expose is basically the release/acquire model of memory semantics.

This means that as a programmer, you still have to code against the release/acquire model because the compiler may reorder your memory accesses. Having TSO in hardware is still helpful though, because it means the compiler has to emit fewer explicit barrier instructions at the end. That is, the barriers that you do have in your original code end up being a little bit cheaper (at the cost of having an overall more complex hardware architecture).


In TSO every store is a release and every load is a acquire, so it maps very efficiently to the acquire/release model.


Sure, since it's a stronger model, so code which works on weaker acquire/release hw will work on TSO hw as well. You might as well say that sequential consistency maps very efficiently to a acquire/release model too in the same way.


Isn't TSO the closest practical implementation to a acquire/release model? What are the practical differences?

I know that TSO allows more easily to recover sequential consistency with additional barriers (Intel strengthened their original memory model to TSO for this reason).


No, acquire/release is a much weaker model than TSO. TSO is basically sequential consistency (SEQCST) + a store buffer. I.e. stores don't go immediately to main memory, but rather via a store buffer. Loads first peek into the store buffer of the local CPU core before going to memory. The practical effect is that in contrast to SEQCST stores may be reordered after loads ("store->load" reordering more formally).

Acquire-release consistency allows many more reorderings in addition to store->load (load->load, load->store, store->store).

For more info see e.g. Table 5 in http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...


One thing not mentioned here (nor in previous discussions of the article, it seems) is that DMA is typically not coherent with the CPU caches. This is kinda visible from the little diagram at the top, with the disk sitting on the other side of the memory, but it should be explicitly spelled out. If you're using a DMA device (memory<->device or memory<->memory copies), you might end up in a state where the DMA and the CPU see different values. This usually means data transfer to/from a Disk or GPU, though other peripherals might use it too.

Your options here are either to manually invalidate your caches and synchronize with the DMA (e.g. via interrupts), or to request from the OS that the given memory section be entirely uncached; or in some cases, you can get away with a write-through cache policy, if the DMA is only ever reading the memory.


I think DPDK does some user-level trickery to achieve per-core caching through DMA, do you happen to know how they go about it?


For those interested (in 2014!) I did a rather simple analysis of CPU caches and for loops to point out some pitfalls:

https://austingwalters.com/the-cache-and-multithreading/

Hope it helps someone, I tend to link it to my co-workers when they ask me why I PR'd re-ordering of loops & functions OR when they ask how I get speedups without changing functionality.


I've never heard of the MESI protocol before so that was really interesting to read, and I liked the comparison to distributed systems.

I'm wondering if the MESI protocol could be used in a networked database manner? I feel like you need master node though to coordinate everything though (like the L2 does in the example).


The essential point of MESI is that it doesn't need a single master. In a sense, anyone who has exclusive ownership of a cache line is the "master" for that one cache line.

The downsides of MESI are that (a) it requires broadcasts, which don't scale very well; and (b) it doesn't tolerate partitions -- which also imposes an effective scaling limit, since large systems are always partitioned (usually with a partition of N-k and k partitions of 1, due to k nodes having failed).


> The downsides of MESI are that (a) it requires broadcasts

No, it can be implemented with directory instead, e.g.

https://en.wikipedia.org/wiki/Directory-based_cache_coherenc...

Or various combinations of snooping and directories ("snoop filters", or directories that act as "bridges" between broadcast domains, etc.).

In current Xeon processors (and presumably AMD EPYC as well, thought I don't yet have first-hand experience with those), you have a couple of directories per CPU with snoop filtering, as with tens of cores broadcasting becomes a scalability bottleneck. In the BIOS you can change the mode how it operates, with slightly different names and semantics depending on the CPU generation.


More than just broadcasts as we think of them in networked systems, it requires serialized/transactional broadcasts. Network distributed systems also have problems with ordering which is something made easier when you have a single bus with all transactions in sequence.


> I'm wondering if the MESI protocol could be used in a networked database manner?

The short answer is yes, it can, though you tend to use key ranges (or a "predicate range" for generality) rather than cache lines and addresses.

Taken to its extreme it can produce extremely fast and versatile distributed databases to the extent that different nodes are accessing non-overlapping key ranges, or only reading from shared key ranges.

Neither a master nor broadcasts are really needed. (Though avoiding masters is quite complicated to do right; and a little trickle of broadcasting is often used, just for nodes to discover each other, and check they are still up, but not for every cache transaction.)

The general MESI-like pattern is probably used more commonly for coherent network filesystems than databases, though.

Many network filesystems uses "leases" (or "oplocks") on files or ranges, which are similar to the states in MESI if the filesystem is one of the good ones that is coherent.

This bit of Documentation for SMB, the Microsoft network filesystem, might remind you of the S and M/E states in MESI:

  · Read-caching lease: allows caching reads and can be *shared* by multiple clients.
  · Write-caching lease: allows caching writes and is *exclusive* to only one client.


Something I don't understand is how to deal with cache coherency when you need the same data in a bunch of different configurations.

Take a typical game loop and assume we have a list of Transforms (e.g. world matrix, translation/rotation/scale, whatever - each Transform is a collection of floats in contiguous memory)

Different systems that run in that loop need those transforms in different orders. Rendering may want to organize it by material (to avoid shader switching), AI may want to organize it by type of state machine, Collision by geometric proximity, Culling for physics and lighting might be different, and the list goes on.

Naive answer is "just duplicate the transforms when they are updated" but that introduces its own complexity and comes at its own cost of fetching data from the cache.

I guess what I'm getting at is:

1) I would love to learn more about how this problem is tackled by robust game engines (I guess games too - but games have more specific knowledge than engines and can have unique code to handle it)

2) Does it all just come out in the wash at the end of the day? Not saying just throw it all on the heap and don't care... but, maybe say optimizing for one path, like "updating world transforms for rendering", is worth it and then whatever the cost is to jump around elsewhere doesn't really matter?

Sorry if my question is a bit vague... any insight is appreciated


Assuming that once determined at the start of the frame (e.g. camera position changes after user input handling), the transform matrices are not written to. They can then be freely shared across multiple cores without causing problems with coherency. The cache lines associated with the transform will be set to 'Shared' across all cores. Cache coherency will start to bite you in the ass in this situation if you start mutating the transforms while other threads are reading it, causing cache invalidations and pipeline flushes across all caches owning those lines.

In short, write a transform once and treat it as immutable. Do not reuse the Transform allocation for a good while for subsequent frames to ensure that its cache lines are no longer in cache. If you do need to reuse right away, you can force invalidate cache lines by addresses, so that the single-writer in the next step is the single (O)wner and no other caches need to invalidate anything.


Thanks - I'll have to do a bit more learning to really understand this, e.g. how mutability relates to cache lines and what "Shared" means in that context, but this gives me some good practical direction and insight to take it further :)



This might be a naive question: how did we decide as an industry that cache should be controlled by the hardware, but registers and main memory by the compiler?


Just a guess here, but I'm thinking it has something to do with the registers being an internal core component of the CPU that have been there in some form the whole time whereas cache was a little more of an afterthought when CPUs and other processing units stopped being the bottleneck and started to significantly outstrip the performance of (practical/affordable) memory and memory controllers. It used to be that the memory sub-systems had artificial wait states as the processors could not keep up otherwise, rather than processors siting around waiting for responses.

Also, cache is essentially optional, and its configuration (not just sizes, but how it is shared amongst cores and other units, speeds relative to other memory levels, even how many levels there are (and each level can have different properties) how cache rows are arranged and mapped, their size, ..., etc.) can and will vary between otherwise identical looking systems. If you are compiling to optimise for cache use you end up either having to JiT compile, or compile several versions of some routines and include them all so which one is used can be chosen at run-time, or have different versions of the whole compiled output for different systems. All of those things happen at times anyway for other reasons, but presumably the overall pay-off of doing the same for cache variances isn't high enough for it to be worthwhile building into general purpose compilers (though the cases for/against this sort of work in domain specific compilers and other tool chains may be quite different).

Some designs argue that we shouldn't need to care about the implementation details of any memory let alone L1/2/3/? cache - just access storage and let the OS & hardware make use of the faster memory levels it has access to as it sees fit to optimise that storage access.


The wonderful thing about caches is that they are invisible to the programmer; the designer can add bigger caches, more levels of caches, different configurations of shared/private/inclusive/exclusive caches... and it Just Works.

However, there are many machines, which instead of using caches, use programmer-visible “scratchpads”. There are many reasons for this, but two big reasons are performance-predictability (real-time systems) or to avoid the hw complexity of cache management.

In general though, scratchpads are hugely painful to program for and terrible for code portability.


What do you mean by 'controlled by hardware'? The registers the compiler chooses for you are an abstraction themselves, one of the first things a CPU does is register renaming. Same for main memory, you are presented mostly an abstraction of main memory, with a ton of layers in-between (load/store buffers, write-combine buffers, coherent caches, etc)

Turning the argument the other way, you as a programmer can control a lot about caches: you can prefetch cache lines, invalidate them and use streaming instructions to bypass them.


What you said makes sense. The question is more from the perspective of the compiler/assembler. At these layers you explicitly say, move this number to this register, or read a number from this address. But you very rarely are able to say, copy this chunk of data into this cache line. Sure there are exceptions (like the GPU), and there are cases where you can hack it to do what you want. But in general you don't get to control the cache in a very specific way.


Anyone else start thinking of Rust's mutable/immutable borrow system when reading the MESI algorithm? It's not quite the same - with Rust, mutable borrows are never shared, and you can never mutate a shared read-only borrow - but the principle seems like a simplification of the full MESI protocol.

It seems like this would be generally applicable for a wide variety of distributed & concurrent applications.


Keeping the directory coherent is the difficult part when translating directory-based cache coherence protocols to other distributed systems problems. The directory is like an oracle that sees every transaction in order. This is hard in most network distributed systems problems where you have to worry about availability, network partitioning, or durability of this node.


Indeed, the evolution will probably in the other direction, with the on chip network adopting algorithms from wider networks to deal with scaling problems. DRAM interfaces use to be pretty simple, now they are being trained almost like a DSL line.


AMD's Rome processors are an interesting case, with 8MB of L3 cache per core. So the 64 core processor has 512MB of L3 cache. It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server. An early 90's fridge sized Sun 690MP maxed out at 1GB of DRAM and had 1MB of L2 cache, no L3.


Half that - 4 MB per core so the 64 core CPU has 256 MB (dual socket is where the 512 number comes from but that's 128 cores and NUMA).

It's also not fully accessible, each core can only directly access the 16 MB in its group of 4. Everything else is the same as a cross cache read.


Ah, yeah. Mixed up their CCD and CCX terms. The 690MP was dual socket though, so still a somewhat valid comparison.


> It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server.

Whereas today, 512MB is a bare minimum amount of DRAM in a general-purpose desktop. Times change.


I don't think you can run a modern Linux or Windows 10 desktop on 512 MB RAM. Even my slim Linux desktop (no DE) consumes about 400 MB of RAM after login. Web-browsing with less than ~2 GB of memory doesn't seem feasible.


> “different cores can have different/stale values in their individual caches”.

Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.

And what about virtual caches? Non-coherent cache lines?

Moreover, even in the face of cache coherency you can still have race conditions.


> Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.

what do you mean? Either two caches agree on the content of a cacheline or one of the cacheline is marked invalid (and the stale content is irrelevant). There are components of a core that might not respect coherency, like load and store buffers and arguably registers, but not caches (on cache-coherent systems of course).

Virtually addressed caches are an issue and that's why they have fallen out of favor.


The one-line summary seems to be that one should never worry about caches themselves introducing concurrency bugs.

I mean after we account for memory operations reordering on each core, the memory address storing a single value that is visible to all cores is a correct model from the concurrency-correctness point of view, right?


On x86, the correct model is that on any core, all reads are in order, all writes are in order, and no write will ever be moved earlier than a read on that core.

Or in other words, the only kind of visible reordering that is allowed to occur is that writes can be delayed past reads.

An example of a situation where this is significant:

    thread 1       thread 2
    mov [X], 0     mov [Y], 0
    mov [X], 1     mov [Y], 1
    mov r1, [Y]    mov r2, [X]
    
After this sequence of code, r1 == r2 == 0 is legal. (As is any other combination of 1 and 0.)

(edit:) And just to add, all this reordering is of course impossible to detect on just one core, as when a read request hits a recent write on the same core, it reads it out of the store queue. This can sometimes be really bad for performance, though, as if you read a value that is partially in the store queue (such as, write 16-bit value to x, the immediately read 32 bits from x), some cpus will stall that read, and all that follow it, until the entire store queue is flushed. Since the store queue can easily take tens if not hundreds of cycles to clear, this can be very expensive.


It's worth pointing out that the L1 cache and its associated logic is the only* way a core talks to the outside world, including all I/O ever. With that in mind it is easy to understand why it is so crucial to performance.

* there might be some minor exceptions


The biggest myth is that any of this matters to anybody but a tiny fraction of niche programmers.

The reason some "app" is slow is not because of cache coherence traffic. It's because somebody chose the wrong data structure, created some stupid architecture, wrote incomprehensible code that the next guy extended in a horribly inefficient way cause they didn't understand the original. My web browsing experience is slow because people include heaps of bloated JS crap and trackers and ad networks so I have to load 15 megabytes of nonsense and wait for JS to render stuff I don't want to see. None of this was any better if anybody involved understood CPU caches better.

Even in the kernel or HPC applications, most code is not in the hot path. Programmers should rather focus on clean architectures and understandable code. How does it help if you hand-optimize some stuff that nobody understands, nobody can fix a bug, nobody can extend it. That's horrible code, even if it's 5% faster than what somebody else wrote.

TL;DR: This is interesting, but likely totally irrelevant to your day job. In the list of things to do to improve your code, it comes so far down that you'll never get there.


Half a decade - woow! Sounds like the author has spent half a century there..


the entire design of unix is realized in a moment when a motherboard is seen as a network.


Should include tl;dr your concurrency fears are real, but for registers and not caches.


If you have two threads reading then writing values in memory, you still need synchronization/atomic changes at the software level.


Note that this is quite specific to x86, on other architectures like Power there are much weaker guarantees that will lead to problems when assuming the same model.


So another myth for this list is "caches are at all similar between architectures"?


which part is x86 specific?


To quote from https://fgiesen.wordpress.com/2014/07/07/cache-coherency/

"Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model."


Yes, I know the difference between the memory model of x86 and, say, ARM. I'm asking what's x86 specific on this article about cache coherency.


The article explicitly mentions two times things that are only true for x86 (grep for it). In addition, the statement at the end is definitely not true for POWER: "As soon as the data is read/written to the L1 cache, the hardware-coherency protocol takes over and provides guaranteed coherency across all global threads. Thus ensuring that if multiple threads are reading/writing to the same variable, they are all kept in sync with one another."


Those two things are not x86 specific (the author only gives x86 an example). And the statement you quote is certainly true for POWER or any other cache coherent architecture.


Sorry you are right, I confused this with reordering.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: