Most of these latencies were measured and written up for a bunch of systems by Carl Staelin and I back in the 1990's. There is a usenix paper that describes how it was done and the benchmarks are open source, you can apt-get them.
His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.
lmbench was trying to give you both bandwidths and latencies of everything in a computer system, not just memory. This is one place where it is actually worth memorizing the rough numbers, they help immensely when you are sketching out a design. lmbench tried to give you insight into latency/bw of network, disks, memory, etc.
Do you know, to the nearest order of magnitude
round trip time over the network
bandwidth over your data center's network
seek time end to end (not the silly 1/3rd seek / no rot delay)
bandwidth at the fast end
bandwidth at the slow end
memory read / write / bcopy bandwidth
random read from memory
I've sat in hundreds of design meetings where someone was claiming they'd build an XYZ server that did $BIG_NUMBER of ops/sec and I'd just stare at the ceiling, think about the network, and go "Maybe" or "not possible" based on the numbers. There must have been some time that I was wrong but I don't remember it.
It's somewhat harder today because all the big machines have multiple cores so it's not as simple as knowing what 1 core, with 1 network interface can do, but you should at least know that, you can assume linear scaling as an upper bound and have some idea of the capacity of the machine.
It's always amazed me how easy it is to get a feel for these numbers and yet many people in the biz don't take the time to do so. I can only guess they think it is harder than it actually is.
Latency does not prevent X opps/second. It's not that hard to build a server that does a simple transaction 100,000 times a second. Keeping synchronized across several machines at that throughput level can be next to impossible.
Pedantic quip: I have a hard time believing you guys were measuring half nanosecond cache latencies on a machine with a 100MHz clock. :)
And actually the cache numbers seem optimistic, if anything. My memory is that a L1 cache hit on SNB is 5 cycles, which is 2-3x as long as that table shows.
We didn't believe it either until we put a logic analyzer on the bus and found that the numbers were spot with respect to the number of cycles. I don't remember how far off they were but it wasn't much, all the hardware dudes were amazed that software could get that close.
tl;dr: the numbers were accurate to the # of cycles, might have been as much as 1/2 of 1 cycle off.
Edit: I should add this was almost 20 years ago, I dunno how well it works today. Sec, lemme go test on a local machine.
OK, I ran on a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (I think that's a Sandy Bridge) that is overclocked to 4289 MHz according to mhz, and it looks to me like that machine takes 4 cycles to do a L1 load. That sound right? lmbench says 4.05 cycles.
That mem_lat3 is pretty interesting, is there more info about it anywhere? Do you know what "Latency" refers to? Is that for Sandy Bridge consumer quad core? (hence ending at 8MB since that is L3$ size...)?
Wasn't aware of it, so thanks for the question. I took a quick look and it looks like Agner is using performance counters, which, while being more accurate, are platform specific. One of the goals of lmbench was to be accurate across different platforms and CPU architectures. So it's all userland software only.
You can get pretty far that way, Carl wrote a userland tool called mhz that has been working without modification since around 1998 (when the fastest clock rate was 600mhz on an alpha). Still pretty accurate to this day when CPUs have clock rates 7x faster and fairly different architectures.
http://www.bitmover.com/lmbench/lmbench-usenix.pdf
If you look at the memory latency results carefully, you can easily read off L1, L2, L3, main memory, memory + TLB miss latencies.
If you look at them harder, you can read off cache sizes and associativity, cache line sizes, and page size.
Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of the memory latency test.
http://www.bitmover.com/lmbench/mem_lat3.pdf
His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.