If cache coherence is relevant to you, I strongly recommend the book “A Primer on Memory Consistency and Cache Coherence”. It’s much easier to understand the details of coherency from a broader perspective, than an incremental read-a-bunch-of-blogs perspective.
I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.
Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).
If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.
“A Primer on Memory Consistency and Cache Coherence” is part of the "Synthesis Lectures on Computer Architecture" which are 50-100 page booklets on topics related to HW components. All the booklet PDF's are available online [1].
edit: only those PDF's with a checkmark are available as PDF to download, the rest can be bought. Quite a few actually available for download.
Are the details really helpful for performance work? MOSI, MOESI, MERSI, MESIF etc are 99% irrelevant to having the right metal model. "Dirtying cache lines across different cores/threads is slow" is most of what you need to know. Within the same core the coherency protocols between levels of cache is not really visible at all to software even as varying performance artifacts.
Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture and making a cross cpu cache miss slightly less slow could just maybe be helped by the detailed understanding of the machine model, but by that time you're already far in the not-giant-turd territory (at least if you're optimizing the right thing).
Of course computer architecture is fascinating and fun to learn about.
> Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture
Or if you work in finance, mining, oil industry, medical, biosciences and countless other fields where you need to get good performance out of your hardware. Yes, even if you use GPUs, they're no magic bullet, they also have their architectural bottlenecks, strengths and weaknesses.
Or if you care about power consumption.
There are a lot of reasons to optimize hot paths and inner loops. CPU single core performance isn't improving much anymore, and we need to make better use of what we have.
You could just as well for example say that front end Javascript developer almost never needs to understand event callbacks or how DOM works.
If you write multithreaded high performance code, yeah, you do need to know about cache coherency at varying levels of detail. Sometimes rough rules of thumb work, not too often you need to understand all those annoying performance destroying details that leak through cache abstraction.
DOM and event callbacks are basic bread and butter for JS developers and in web apps it's essential to understand how DOM works and how to minimize DOM changes when doing performance work. Can you come up with an example where a oil or finance sector developer needs to understand MESIF?
Or are we having a terminology confusion and you use the coherency term for software visible performance characteristics of caches generally? I do agree that understanding cache effects and their intersection with multiprocessing is generally important in perf work. As is understanding the architected memory model, which tells you what you can and can't rely on semantically.
> Can you come up with an example where a oil or finance sector developer needs to understand MESIF?
Yes. When they're writing high performance multithreaded analysis software and they're deciding which cache lines to write to and which only to read from. Those lines you only read from can be in S (shared) or F (forward) state.
And why is this important? Performance characteristics of a line entering in M (modified) state are pretty bad, if the line is shared between multiple cores.
Perhaps you'll also want to do all reads at once, knowing the line will more likely remain in F state for short periods, instead of bouncing S/F line state between CPU cores?
NUMA comes also to play, making this mistake even more costly. You really want to keep inter-core (especially NUMA socket!) communication to the minimum.
Of course, you could say you don't strictly need to understand MESI (or MESIF), but it really helps understanding why you do things certain way and reasoning about multi-core performance. The thing is, you can say same "you don't need to know this" about a lot of "low level details" in software trade.
Just like you need to understand DOM as a front-end developer to minimize DOM changes, even if you don't access it directly.
In cache coherency case it's analogously about reducing unnecessary multicast and broadcast messages sent between cores.
Thank you for writing a scenario and how it relates to coherency mechanisms.
So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.
But yeah I like reading details about microarchitectural details and other computer architecture topics, and am symphatetic to the point of view that knowing the "why" is nice. Just like I find it interesting to read about how DOM APIs are implemented in browsers and why they are hard to make faster...
> So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.
The hidden thing behind all this is that even if the data is just read-shared, it can still generate traffic between cores and sockets.
Since these communication links are a shared resource [0], doing things wrong hurts performance in unrelated code and cores. Just because of storm of cache coherency packets is being sent between cores.
So yeah, you really do want to minimize this to maximize performance and scalability across the whole system!
[0]: In Intel's case, this shared resource is ring bus inside CPU socket and QPI between CPU sockets.
True, high validation traffic of read-shared data can indeed have effects in some cases, especially on multi socket systems, and it can sometimes be beneficial to have private copies of even read-only data for different threads if the data is small.
>> Dirtying cache lines across different cores/threads is slow
Very succinct and correct. This is particularly pernicious with atomics, which people use for lock-free stuff such as queues and work stealing thread pools. If atomics used by different threads/cores share a cache line and you mutate them a lot, perf gets instantly fucked, sometimes worse than if you used a mutex in the first place. And if you don't benchmark, you aren't going to notice.
Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details - as these are free to be evolved.
Of course performance matters, but surely having performance tests, rather than trying to second guess what the whole stack below you might be doing, is
1. more efficient
2. more accurate
3. more likely to detect changes in a timely way.
That's not to say, you shouldn't be curious and deep understanding isn't a good thing.
Just saying understanding inside-out the abstraction you are working with ( eg Java Memory Model ) it's performance characteristics ( from real world testing ) - is more important than some passing knowledge of real world CPU design.
This app I am using right now is in a webbrowser - not sure how understanding cache coherency helps in a single threaded javascript.
> Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details
In my opinion an important part of being a good programmer is understanding - at least at a broad level - how the set of abstractions you're working on top of work.
At the end of the day, our job is making computer hardware operate on data in memory. The more that we forget that, and think about computing as some abstract endeavor performed in Plato's heaven, the more tendency we have for bloat and inefficiency to creep into our various abstractions.
In other words, I think it's better to think about abstractions as a tool for interacting with hardware, not as something to save us from dealing with hardware.
Eh, idk. Most programs that run like a giant turd do it because they load 15 megabytes of Javascript libraries to call two functions, or something to that effect. Computers are so fast now that you really need to be doing something unbelievably stupid for things in consumer programs to not be instantaneous.
Like start Firefox? I have no idea why it takes multiple seconds to bring up a blank window when starting a comparably useful program like Claws Mail is absolutely instantaneous.
To be honest, yes, generally programs that don't start instantaneously do something very unnecessary and stupid that has nothing to do with actual CPU performance. As in they read thousands of tiny files or they decompress files or they wait for some sort of answer from the network, or they make 10k+ expensive API calls or all of the above. Firefox is so big it might fall into the "all of the above" category, but to answer that question definitively you'd have to analyze what Firefox actually does.
And of course, making the decompression 15-20% faster by optimizing the decompression code (which is usually not even written by the developers of said software but just some external library) won't even make a difference because 20% less than 5 seconds is still 4 seconds which is way too long for a program to start. Instead using a different compression algorithm that increases file size by 25% but decompression speed by 10x would actually start solving the problem, with the next step being to ask why the program needs to read so much damn data at the start in the first place.
But since NVMe SSDs and Intel CPUs with very high boost clocks are quickly becoming the norm now even for laptops I don't see much of that happening, because Firefox starts pretty quickly (~1 second) on those machines.
Fwiw, Firefox stores almost everything it will need on startup in one large file (omni.ja) precisely to avoid the "thousands of tiny files" problem. That data is uncompressed, precisely to avoid the decompression problem. The low-hanging fruit has largely been picked.
As for why so much data needs to be read... I just checked, and on Mac the main Firefox library (the executable itself is mostly a stub) is 120MB. So that's going to take a second or three just to read in at typical HDD speeds (faster on a good SSD), and then the dynamic linker has to do its thing on that big library, which is not instantaneous either.
How big is the Claws Mail binary? A large part of the cost of starting up Firefox, or other large programs, is just getting the binary from disk into memory plus all the work the dynamic linker needs to do once it's there.
99.999% of performance issues in development are related to the abstract model, and not the underlying implementation details. Things such as searching in a big unordered list, repeating the same work, stupid SQL queries ect.
The way I usually follow it is
1) Is this OK? For example It is only called once in the code in an error path?
2) Did I do something silly? For example, I left in some extra debug code.
3) is the code doing something silly? For example extra work, dragging in extra data that is not needed?
4) is the code written in a way that causes extra work, re-init on inner loop, loop hoisting, etc
5) Is there a better algorithm for this? Binary search vs linear?
6) is the code doing too many things at once. De-serialize it re-serialize it
7) Is there a better abstraction for this? Monolith code vs microservice?
8) Is there any compiler switches that are 'easy' wins? Packing, -O2, etc? Usually not.
9) What sort of memory arch does this machine have? It varies between machines even for the x86 world. For example if I rearrange my memory structures do I use less cache and emit less code. The cache line on this box is x bytes and on that box is y bytes. Some languages do not lend themselves well to this one due to the way their VM/ISA is written.
10) is there some asm that would help? usually not.
Usually 1-7 are all you need. If you get all the way to 10 you are in the deep end for most things.
Big O is good for many things. But in reality big O is O(N)+C where the C can get you. That is where the later steps help. But usually you can totally ignore it. Most of the big wins I get are just from flipping out a bad search for a O(log(n)) search, or removing 'extra' code.
Oh I agree. It is just that +C bit. What happens when your O(nLog(n)) basically just trashes the cache because of your data structures? Yet maybe something 'worse' would run better because of the way cache works? That +c bit can sometimes turn into its own big O. It is a nasty little gotcha when it does happen.
It is a good framework to get you in the ballpark of the correct thing. Even usually 99% of the time it is right. But sometimes the arch bites back due to your data.
The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.
While I agree that the details of StoreLoad are likely a distraction the big picture concepts of cache coherence presented in this article are table stakes for performant systems.
The problem is you never know when you’ll go from “dev who doesn’t care” to “dev who does”.
This is true for literally every hard problem in dev though, and the implication that you need to grasp everything just in case you need it is silly. The problem space in compsci is too big to know everything. We have to choose.
While that might be true in practice, I do think not knowing when you transition between those two indicates a failing by your coworkers/mentors.
I encourage people I work with to read many of the books in this book series. I particularly encourage them to read “Hardware and Software Support for Virtualization”, since it’s basically a book on their job.
Money is another way to measure performance. People that understand these low level principles can write a backend service in C/C++/Java/Erlang/Whatever that can do the work of hundreds of instances of a service written in higher level languages that ignore these concepts. That's not to say that it's never wise to use higher level languages, but these principles can and do save mature established businesses real money.
True, but if my bad python implementation can do the job on one old single core computer with plenty of CPU left over what do I care - in fact because python is generally more productive for small programs it is a lot cheaper. Most programs don't actually handle so much data that they need every bit of performance.
Of course the minority that do need performance are where the real engineers are needed.
I was just in a room with a guy who claims he "takes pride in his work" so everything has to be as fast as technically possible., use the least RAM down to KBs, never write to a disk if possible, etc. The problem is the schools. Academics have no concept of who is paying for everything around them its like asking a welfare recipient to teach you about budgets.
I disagree on this. More programmers should understand the performance characteristics of the abstraction layers that they rely on. Else we have the case of jerk programmers who insist on redesigning / rewriting / refactoring to optimise for CPU caches while still running the apps via a bunch of docker containers each based of the centos image to run one simple binary that probably needs only glibc.
Maybe but it feels like most are stuck in environments which will do bad things to their cache coherence. It's fine if you are doing some data processing in C. If you are using .NET with a bunch of magic libraries or Javascript or whatever, sure it will help, but to actually make an impact you have to be very careful.
I agree with you Jonathan but am wondering, will Jai help programmers write programs with better cache coherency even if said programmers don’t understand cache coherency well? Or is that orthogonal to the goals of Jai?
Like any reasonable language, I imagine JAI will allow a developer to write software with the cache in mind. It will not force them to, nor will it prevent them from doing otherwise.
If it still plans on wrapping SOA in something that looks in use like AOS, it could make people less aware of how their code is impacting cache (would now have to look at the definitions instead of seeing the array form at point of usage). But if it is enough more ergonomic to write AOS code it might still be worth it and increase uptake.
It's not what DEC Alpha did, but what it didn't do. ;-)
The issue that comes up on Alpha is this code:
thread1() {
x = …; // Store to *p
release_barrier(); // guarantee global happens-before
p = &x; // ... and now store the p value.
}
thread2() {
r1 = p; // If this reads &x from thread1,
r2 = *r1; // this doesn't have to read the value of x!
}
The Alpha's approach to memory was to impose absolutely no constraints on memory unless you asked it to. And each CPU had two cache banks, which means that from the hardware perspective, you can think of it as having two threads reading from memory, each performing their own coherency logic. So you can have one cache bank reading the value of p who, having processed all pending cache traffic, saw both the stores, and then you can turn around and request the other cache bank to read *p who, being behind on the traffic, hasn't seen either store yet.
Architectures with only one cache bank don't have this problem. Other architectures with cache banks feel obligated to solve the issues by adding extra hardware to make sure that the second cache bank has processed sufficient cache coherency traffic to not be behind the first one if there's a dependency chain crossing cache banks.
So on every other platform than alpha thread2 works correctly without any barrier?
Does this mean when you use double-checked locking on p on non-alpha systems, you do not need any kind of synchronization on the fast path where p is initialized?
So this would be correct?
if (!p) {
T *x = new T;
release_barrier();
if (!compare_and_swap(p, 0, x))
delete x;
}
IMO, the various blogs and tutorials out there that help to make sense of the C++11 memory model make better tutorials than the Linux kernel's own shenanigans.
The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.
> The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.
yes and no. Alpha is relevant because it is the only architecture where consume requires an explicit barrier, but then again, I think the revised C++11 memory model might not even be fully implementable on Alpha;
Consume primarily exist because acquire is very expensive to implement in traditional RISCs like POWER and 32 bit ARM, while consume can be implemented with a data or control dependency. Aarch64 has efficient acquire/release barriers, so it is less of an issue.
C++11/C11 are not implementable on Alpha CPUs that don't have BWX extensions, BTW.
This is because it requires individual bytes to be accessible in atomic way between multiple threads, which without BWX is not possible on Alpha, which started out with minimum 32bit read/writes.
yes, but even beyond that, I think think there were changes and/or clarifications to the semantics of releaxed atomics that made the straightforward no-barriers-at-all implementation on alpha non-conformant strictly speaking. But I might be misremembering.
Man I just LOVE how the linux kernel can have such detailed and important information in a simple TEXT file. Fantastically functional and prove they focusing on the right stuff.
At my previous company. I would have to spend the better part of a day to get my "documentation" in the "correct" Confluence Style And Manner. They were adamant THAT'S were the value is, to have documentation is the most beautiful and absurd style" and double-linked structure. You would have to block out a day or two in your scrum(what nonsense) just to focus on your documentation. And this is not some sort of important* software like Linux or Banking... but a stupid website.
You make it sound like CPU caches are the only caches around. I deal with higher-level caching a lot, and I'm not writing an OS. Is your book recommendation still useful for me?
I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.
Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).
If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.