Intel DDIO, LLC cache, buffer alignment, prefetching, and packet rates

ori_b · on April 12, 2015

One thing that I've always wondered was why it seems like most cache line mappings seem to collide at powers of two -- since so much software uses power of two sized buffers, it seems like it would be a big win to bucket by some non power of two size.

dfox · on April 12, 2015

The point of having less-than-fully-associative cache is that it is both cheaper and faster than fully-associative one. Using some kind of complex mapping would require significant additional circuitry, which would certainly introduce significant additional delay (or even additional clock cycle).

ori_b · on April 12, 2015

I'm not saying that the cache should be fully associative, I'm saying that instead of each bucket being a power of two, make it a prime number -- eg, 67 byte cache lines instead of 64, for example, seem to me like they would prevent quite a bit of aliasing with common data sizes.

There would be more circuitry to divide the addresses. I haven't done too much with hardware (only some light FPGA experience), so I don't know how much extra, but it doesn't feel like it would be prohibitive. I could be wrong.

(If you're aware of someone doing analysis on that, I'd be curious to see it -- I haven't seen any.)

jkloosterman · on April 12, 2015

Computer architecture researchers have been working at this for a long time. Here's a recent paper on a technique to do the division and modulus operations for arbitrary numbers of cache banks and set sizes.

http://www.cs.utexas.edu/~skeckler/pubs/MICRO_2014_Modulus_I...

They evaluate the technique on a GPU because bank and set conflicts can lead to bad performance more easily on a GPU than a CPU.

dfox · on April 12, 2015

What I meant is that making the cache fully-associative would result in faster implementation than N-way associative with non-power-of-2 cache line size or bucket count because of the additional circuitry involved in selecting right bucket (integer modulus and so on).

Interesting trick is using different values for cache index and tag (logical address for index and physical address as tag is common, but that's mostly about doing cache lookup concurrently with TLB lookup). In theory one could use result of some hash function as index, which would certainly change the aliasing behavior (although it is not exactly certain that such change would be for the better) and again, there is the additional cost of calculating said hash function (in combinatorial logic and with reasonably small delay).

MarkSweep · on April 12, 2015

I don't know much about this either, but I think if cache line size is not a multiple of the RAM transfer size, bandwidth will be wasted.

zerohp · on April 12, 2015

It would be prohibitively slow. Division and modulus are the slowest integer operations.

TheLoneWolfling · on April 13, 2015

Correction: Division by arbitrary integer valued variables are the slowest integer operations.

You can accomplish division by a constant / modulus by a constant without a full division, however. For example, (x == x % 256 - x * 256), mod 257. That sort of thing.

ramubhai · on April 12, 2015

Processor cache optimisations feel like true fine-art. :)

userbinator · on April 13, 2015

Incidentally, unaligned data accesses on x86 CPUs have been basically as fast as aligned ones in most cases for the past few years:

http://lemire.me/blog/archives/2012/05/31/data-alignment-for...