According to the Stephan Brumme website you linked to, the slice-by-8 lookup tab...

brongondwana · on Dec 4, 2015

Most of the time we're iterating through a cyrus.index, where there's 104 bytes per record, and we're doing a CRC32 over 100 of them, or we're reading through a twoskip database where we're CRCing the header (24+bytes, average 32) and then doing a couple of mmap lookup and memcmp operations before jumping to the next header, which is normally only within a hundred bytes forwards on a bit and mostly sorted database. The mmap lookahead will also have been in that close-by range.

Also, our oldest CPU on production servers seems to be the E5520 right now, which has 128kb of L1 data cache.

vardump · on Dec 4, 2015

I'm fairly sure E5520 has 32 kB L1 data cache, not 128 kB. L1 caches are core local, not shared like L3.

vidarh · on Dec 4, 2015

This is what the datasheet says [1]:

    - Instruction Cache = 32kB, per core
    - Data Cache = 32kB, per core
    - 256kB Mid-Level cache, per core
    - 8MB shared among cores (up to 4)

So I guess the confusion is that Intel moved the L2 cache onto each core (from Nehalem onwards, I think?) and used that opportunity to substantially lower latency for it.

[1] http://www.intel.com/content/www/us/en/processors/xeon/xeon-...