Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most of these latencies were measured and written up for a bunch of systems by Carl Staelin and I back in the 1990's. There is a usenix paper that describes how it was done and the benchmarks are open source, you can apt-get them.

http://www.bitmover.com/lmbench/lmbench-usenix.pdf

If you look at the memory latency results carefully, you can easily read off L1, L2, L3, main memory, memory + TLB miss latencies.

If you look at them harder, you can read off cache sizes and associativity, cache line sizes, and page size.

Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of the memory latency test.

http://www.bitmover.com/lmbench/mem_lat3.pdf

His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.



One other thing that is worth mentioning:

lmbench was trying to give you both bandwidths and latencies of everything in a computer system, not just memory. This is one place where it is actually worth memorizing the rough numbers, they help immensely when you are sketching out a design. lmbench tried to give you insight into latency/bw of network, disks, memory, etc.

Do you know, to the nearest order of magnitude

round trip time over the network

bandwidth over your data center's network

seek time end to end (not the silly 1/3rd seek / no rot delay)

bandwidth at the fast end

bandwidth at the slow end

memory read / write / bcopy bandwidth

random read from memory

I've sat in hundreds of design meetings where someone was claiming they'd build an XYZ server that did $BIG_NUMBER of ops/sec and I'd just stare at the ceiling, think about the network, and go "Maybe" or "not possible" based on the numbers. There must have been some time that I was wrong but I don't remember it.

It's somewhat harder today because all the big machines have multiple cores so it's not as simple as knowing what 1 core, with 1 network interface can do, but you should at least know that, you can assume linear scaling as an upper bound and have some idea of the capacity of the machine.

It's always amazed me how easy it is to get a feel for these numbers and yet many people in the biz don't take the time to do so. I can only guess they think it is harder than it actually is.


Latency does not prevent X opps/second. It's not that hard to build a server that does a simple transaction 100,000 times a second. Keeping synchronized across several machines at that throughput level can be next to impossible.


Pedantic quip: I have a hard time believing you guys were measuring half nanosecond cache latencies on a machine with a 100MHz clock. :)

And actually the cache numbers seem optimistic, if anything. My memory is that a L1 cache hit on SNB is 5 cycles, which is 2-3x as long as that table shows.


We didn't believe it either until we put a logic analyzer on the bus and found that the numbers were spot with respect to the number of cycles. I don't remember how far off they were but it wasn't much, all the hardware dudes were amazed that software could get that close.

tl;dr: the numbers were accurate to the # of cycles, might have been as much as 1/2 of 1 cycle off.

Edit: I should add this was almost 20 years ago, I dunno how well it works today. Sec, lemme go test on a local machine.

OK, I ran on a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (I think that's a Sandy Bridge) that is overclocked to 4289 MHz according to mhz, and it looks to me like that machine takes 4 cycles to do a L1 load. That sound right? lmbench says 4.05 cycles.

I poked a little more and I get

L1 4 cycles, ~48K L2 12 cycles, ~256K L3 16 cycles, ~6M

Off to google and see how far off I am. Whoops, work is calling, will check back later.


You realize he is measuring averages right? Setup 1 million L1 cache hits, divide by total time...


That mem_lat3 is pretty interesting, is there more info about it anywhere? Do you know what "Latency" refers to? Is that for Sandy Bridge consumer quad core? (hence ending at 8MB since that is L3$ size...)?


That data is old, it's when Wayne was still working at Intel, he's been working here since 2001, so I'm guessing that is a pentium pro or similar.

I'll ping him and he'll probably answer tomorrow, it's late in his day in Indiana.


This is the original Pentium Pro with the full speed L2. Don't forget the TLB when parsing the data.


Interesting. What do you think of Agner Fog's work?


Wasn't aware of it, so thanks for the question. I took a quick look and it looks like Agner is using performance counters, which, while being more accurate, are platform specific. One of the goals of lmbench was to be accurate across different platforms and CPU architectures. So it's all userland software only.

You can get pretty far that way, Carl wrote a userland tool called mhz that has been working without modification since around 1998 (when the fastest clock rate was 600mhz on an alpha). Still pretty accurate to this day when CPUs have clock rates 7x faster and fairly different architectures.

Carl wrote up how it works, might be interesting to people who think about CPUs: http://www.bitmover.com/lmbench/mhz-usenix.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: