I only wish that at $DAY_JOB I ever encountered a performance problem that was as subtle as "oh you're not properly using the L2 cache". It's more like, "why did you choose an N^3 algorithm when you can do it in N log N" or "why is your design to write the same file on disk 1,000 times to perform one operation?" sigh
I wouldn't wish that on anyone to be generous. After long, hard battle, only those with perverse perseverance may find such solution and only a few can find joy in doing it.
bcantrill (who worked with Bonwick @ Sun) is always touting that paper as a classic of systems research. I finally gave in and bought a copy of the proceedings from Usenix where it was presented. Probably the best paper in a first-rate class of them.
This whole article is some sort of nonsense straw man. Of course if you space small amounts of data spaced out so that they are all far apart you will have bad performance. When people talk about 'alignment' they are talking about 32, 64 or 128 bit for atomics and 128, 256, and 512 bit for SIMD. Putting 8 bytes at the beginning of a 4096 byte section and the next 8 bytes right next to it isn't 'unaligned access'.
Two cache lines of 64 bytes each are cached on even a single byte access, so of course it will be faster to pack data together, but who thought differently? It also saves TLB cache misses as well and if accessed linearly will be prefetched.
This isn't about scanning a table where the elements are allocated contiguously, it's about scanning a linked list where the elements could be anywhere; the next element is almost never going to already be in the cache. That's already bad enough (a cache miss on every step down the list), but those items are all page-alined (they are all in the same position relative to the 4k memory page that they are in), then they all have a lot of memory address bits in common and end up evicting each other out of the cache, slowing your scans down. If you only scan the list once then it's no big deal, but if this is some frequently-accessed kernel data structure (list of processes or threads or GDI objects or what have you), then it's a big problem.
One fix is to ensure that these objects are stored at locations that don't share so many common bits, or for the CPU to actually have a hash function that it uses on the address, rather than using the address as-is.