Most of these latencies were measured and written up for a bunch of systems by C...

luckydude · on May 31, 2012

One other thing that is worth mentioning:

lmbench was trying to give you both bandwidths and latencies of everything in a computer system, not just memory. This is one place where it is actually worth memorizing the rough numbers, they help immensely when you are sketching out a design. lmbench tried to give you insight into latency/bw of network, disks, memory, etc.

Do you know, to the nearest order of magnitude

round trip time over the network

bandwidth over your data center's network

seek time end to end (not the silly 1/3rd seek / no rot delay)

bandwidth at the fast end

bandwidth at the slow end

memory read / write / bcopy bandwidth

random read from memory

I've sat in hundreds of design meetings where someone was claiming they'd build an XYZ server that did $BIG_NUMBER of ops/sec and I'd just stare at the ceiling, think about the network, and go "Maybe" or "not possible" based on the numbers. There must have been some time that I was wrong but I don't remember it.

It's somewhat harder today because all the big machines have multiple cores so it's not as simple as knowing what 1 core, with 1 network interface can do, but you should at least know that, you can assume linear scaling as an upper bound and have some idea of the capacity of the machine.

It's always amazed me how easy it is to get a feel for these numbers and yet many people in the biz don't take the time to do so. I can only guess they think it is harder than it actually is.

Retric · on June 1, 2012

Latency does not prevent X opps/second. It's not that hard to build a server that does a simple transaction 100,000 times a second. Keeping synchronized across several machines at that throughput level can be next to impossible.

ajross · on May 31, 2012

Pedantic quip: I have a hard time believing you guys were measuring half nanosecond cache latencies on a machine with a 100MHz clock. :)

And actually the cache numbers seem optimistic, if anything. My memory is that a L1 cache hit on SNB is 5 cycles, which is 2-3x as long as that table shows.

luckydude · on May 31, 2012

We didn't believe it either until we put a logic analyzer on the bus and found that the numbers were spot with respect to the number of cycles. I don't remember how far off they were but it wasn't much, all the hardware dudes were amazed that software could get that close.

tl;dr: the numbers were accurate to the # of cycles, might have been as much as 1/2 of 1 cycle off.

Edit: I should add this was almost 20 years ago, I dunno how well it works today. Sec, lemme go test on a local machine.

OK, I ran on a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (I think that's a Sandy Bridge) that is overclocked to 4289 MHz according to mhz, and it looks to me like that machine takes 4 cycles to do a L1 load. That sound right? lmbench says 4.05 cycles.

I poked a little more and I get

L1 4 cycles, ~48K L2 12 cycles, ~256K L3 16 cycles, ~6M

Off to google and see how far off I am. Whoops, work is calling, will check back later.

WayneS · on June 1, 2012

You realize he is measuring averages right? Setup 1 million L1 cache hits, divide by total time...

notaddicted · on May 31, 2012

That mem_lat3 is pretty interesting, is there more info about it anywhere? Do you know what "Latency" refers to? Is that for Sandy Bridge consumer quad core? (hence ending at 8MB since that is L3$ size...)?

luckydude · on May 31, 2012

That data is old, it's when Wayne was still working at Intel, he's been working here since 2001, so I'm guessing that is a pentium pro or similar.

I'll ping him and he'll probably answer tomorrow, it's late in his day in Indiana.

WayneS · on June 1, 2012

This is the original Pentium Pro with the full speed L2. Don't forget the TLB when parsing the data.

alecco · on May 31, 2012

Interesting. What do you think of Agner Fog's work?

luckydude · on May 31, 2012

Wasn't aware of it, so thanks for the question. I took a quick look and it looks like Agner is using performance counters, which, while being more accurate, are platform specific. One of the goals of lmbench was to be accurate across different platforms and CPU architectures. So it's all userland software only.

You can get pretty far that way, Carl wrote a userland tool called mhz that has been working without modification since around 1998 (when the fastest clock rate was 600mhz on an alpha). Still pretty accurate to this day when CPUs have clock rates 7x faster and fairly different architectures.

Carl wrote up how it works, might be interesting to people who think about CPUs: http://www.bitmover.com/lmbench/mhz-usenix.pdf