Honestly, I'd rather programmers know how to _measure_ these numbers than just have them memorized.
I mean, if I told them that their machine had L3 cache now, what would they do find out how that changes things? (This comment is also a shameless plug for the fantastic CS:APP book out of CMU).
Honest question: how would you measure an L2 cache lookup? (What program would I need to write which ensures that a value is stored in L2, such that when I read it later, I would know the lookup time to indeed be the L2 time?)
At that, how would one measure this kind of thing at the nanoseconds level? Would using C's clock_gettime() functions be good enough? Is there any other facility to count cycles which have passed between two operations?
cracks knuckles Let's see if I remember enough of what I'm supposed to know in an exam in a couple of months or so.
You need to know the size of a page in your L1 cache. Then you can write a program that goes through a big enough chunk of memory that an "out of space" error occurs predictably in L1 cache, but not in L2 cache.
That way you can know when specifically your program went to L2 cache to swap out a page of L1 cache when it was needed.
The caveat being that you can probably only do this well enough in some sort of assembler code (for your architecture) and that you would have to be running a single-process system without interrupts enabled. Otherwise all sorts of things can mess up your cache lookups.
That's actually not good enough, because modern CPUs have prefetchers, so if you just access a 33kB chunk sequentially, it will still almost always be from L1 (the prefetcher proabably uses misses to train, so you'll have a few misses before it catches up), even if the L1 is just 32kB.
To predictably miss caches you need to have a random access pattern.
Well, back in the early 90's, HP RISC was the only cache controller we found that had cache line prefetch. If you keep increasing your stride you eventually get to a big enough one that you are skipping enough cache lines that they stopped fetching.
We switched to going backwards, that fooled them for a while but now most cache controllers seem to detect forwards, backwards, and uniform strides, which is actually pretty neat.
So yeah, today it would seem that random is what you need.
What you could do is write a program that uses increasing amounts of memory. Measure the time it takes to complete a few million iterations of some function that accesses that memory. As the total usage increases, you'll see steps in the timing. For example, you might see a jump at ~32K, then ~2MB, then ~8MB.
I think you'd just need to access memory in increasingly larger increments (eg, sequential reads of 32k, 64k, ...2x size of l2 cache). But there's more to it as the L2 cache is often shared amongst cores in a multi-cpu system (where as the L1 is per-core). Also, the organization of the cache (eg, n-way set associative) means that you'd want to vary the size of the jump (stride) for your reads/writes and see how that also affects throughput.
>But there's more to it as the L2 cache is often shared amongst cores in a multi-cpu system (where as the L1 is per-core)
As comatose_kid notes, this is highly dependent on CPU architecture. Intel's architectures from Nehalem to the current Ivy Bridge have smaller per-core L2 caches, and a shared L3 cache across all cores.
You can't just access memory in that pattern because an out of order processor will just issuse the next N loads in parallel. Instead lmbench sets up a linked list in memory ans walks the list. That way the address for each load is not known until the previous load completes. -Wayne
I don't know how you actually measure how long a lookup takes, but just FYI if you're interested in the number of lookups you're doing from the various caches (and in things like number of missed branch predictions you're causing), you can use OProfile or Cachegrind. OProfile is a bit more complicated to set up because the flags you need to set are processor-specific, but it causes much less slowdown.
I love performance counters and good profile tools. But the whole point of knowning this stuff is to get a handle on expected performance before you have written a line of code. Then you can make good decisions about what to do. Later the profile tools can help you fix what is broken when your code doesn't match expectations.
+1 to CS: APP (Computer Systems: A Programmer's Perspective, by Bryant and O'Hallaron). I thought Computer Architecture: A Quantitative Approach by Hennessey was dry and better suited to EE's. CS: APP, on the other hand, really sucked me in - and I'm not the kind of guy that typically gets enthralled by CS texts.
My problem with the memorization approach is that the technology is still moving quickly enough that the orders of magnitude still "wiggle," particularly in the middle numbers.
For example, at a glance, the disk seek numbers are off by orders of magnitude for SSD drives. And the 1MB of RAM read numbers appear not to have used the SSE streaming load/store instructions. I don't know anything about datacenters, but I'd certainly want to measure that one myself before building a system assuming that number was in its correct place in the hierarchy.
I've love to able to either measure or use all this lovely information in my running Linux desktop GUI app. But I happen to know that would probably not be worth my time except some rather specific cases. Plus the direction of PC architecture seems to be towards more and more levels of cache and that implies more and more work if you actually take these into account.
Just finding the time that library calls take is quite a pain in Linux.
Given all this, I'm inclined to think that I'd rather architect cache-oblivious data structures than root-around endless when background threads and other stuff suddenly get slow...
I mean, if I told them that their machine had L3 cache now, what would they do find out how that changes things? (This comment is also a shameless plug for the fantastic CS:APP book out of CMU).