That last part is important.
I have worked with many engineers who I would even classify as hard working, but spent little to no time understanding the hardware they were running on and the possibilities that it provided them.
I have heard "that's slow" or "that's good" too many times in performance talks that have completely ignored the underlying machine and what was possible.
Learning about how the CPU cache works is probably the most useful thing you can do if you write anything that's not I/O limited. There are definitely a ton of experienced programmers who don't quite understand how often the CPU is just waiting around for data from RAM.
It is a shame that there are not better monitoring tools that surface this. When I use Activity Monitor on macOS, it would be useful to see how much of “% CPU” is just waiting on memory. I know I can drill down with various profilers, but having it more accessible is way overdue.
Digging around in Instruments is the opposite of accessible.
Every OS always had easy ways to tell if a process is waiting on disk or network (e.g., top, Activity Monitor). The mechanisms for measuring how often a process is waiting on memory exist, but you have to use profilers to use them. We are overdue to have them more accessible. Think of a column after “% CPU” that shows percentage of time blocked on memory.
I would do the same thing with the information I get from top and Activity Monitor: use that to guide me to what needs investigating.
I am often developing small one-off programs to process data. I then keep some of these running in various workflows for years. Currently, I might notice a process taking an enormous amount of CPU according to top, but it might really be just waiting on memory. Surfacing that would tell me where to spend my time with a profiler.
I’m having a very hard time imagining how you would go from a “percent time waiting on memory” to something productive without doing more work in between. Even assuming you’re dealing with your own, native code, the number tells you almost nothing about where the problem is. The only process I’ve ever seen working is “hmm I have a CPU-bound performance problem (as reported by e.g. Activity Monitor)” → “I used a profiler and the problem is here or it’s spread out” → “I used a specialized tool”.
My point is that this isn't how performance work is done. You have to first diagnose that the issue is CPU-bound before it being memory bound can enter the picture. Time spent waiting for memory is accounted the same as any other CPU work, so it goes under that metric.
To make an analogy, this would be like adding a metric for function calls into Activity Monitor and using it to diagnose quadratic performance. You can't just take that number and immediately figure out the problem; you need to go look at the code and see what it's doing first and then go "oh ok this number is too high". The same applies to waiting for memory. What are you going to do with a number that says the program is spending 30% of its time stalled on loads? Is that too high? A good number? You need to analyze it in more detail elsewhere first.
You’re really just making a case for firing up a profiler more often. That’s fine, I do that a lot. But what you’re looking for has no meaning outside of that context.
Instruments is not nearly good enough for any serious performance work. Instruments only tells me what percent of time is spent in which part of the code. This is fine for a first pass, but it doesn’t tell me _why_ something is slow. I really need a V-Tune-like profiler on macOS.
I’ve tried to use it professionally, but always end up switching to my x86 desktop to profile my code, just so I can use V-Tune.
It’s missing any kind of deeper statistics such as memory bandwidth, cache misses, branch mispredictions, etc. I think fundamentally Apple is geared towards application development, whereas I’m working on more HPC-like things.
Have you tried using the performance counters? They've been useful in my experience, although I don't touch them often. Instruments is definitely not geared towards this since most application developers rarely need to do profiling at this level, but it has some level of this built in when you need it.
It’s only useful once you understand how algorithmic complexity works, and how to profile your code, and how you language runtime does things. Before that your CPU cache is largely opaque and trying to peer into it is probably counterproductive.
I have heard "that's slow" or "that's good" too many times in performance talks that have completely ignored the underlying machine and what was possible.