Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How misaligning data can increase performance by reducing cache misses (2014) (danluu.com)
94 points by signa11 on July 19, 2017 | hide | past | favorite | 15 comments


I only wish that at $DAY_JOB I ever encountered a performance problem that was as subtle as "oh you're not properly using the L2 cache". It's more like, "why did you choose an N^3 algorithm when you can do it in N log N" or "why is your design to write the same file on disk 1,000 times to perform one operation?" sigh


I wouldn't wish that on anyone to be generous. After long, hard battle, only those with perverse perseverance may find such solution and only a few can find joy in doing it.


Funnily enough I'm only here because I'm waiting for just such an operation to complete. Yeh I'm on here a lot it seems. <sigh>


This is also discussed in section 4 of Jeff Bonwick s 1994 paper "The Slab Allocator: An Object-Caching Kernel Memory Allocator".


bcantrill (who worked with Bonwick @ Sun) is always touting that paper as a classic of systems research. I finally gave in and bought a copy of the proceedings from Usenix where it was presented. Probably the best paper in a first-rate class of them.


And if you really want to see me tout it, check out my Systems We Love talk on the slab allocator![1] ;)

[1] https://systemswe.love/videos/down-memory-lane


Old Cray computers had the best memory bandwidth when the fast dimension of an array was of length 64*n +- 1.



It would actually be a nice feature to have in HN, if it gave you previous submissions automatically.


There's a button which does that: "past", up right under the post link.


Oh! I don't know why I never noticed that.

You know what would make it better is if it added the number of past submissions just like total comments.

2 past

5 comments


Like "previous discussions" on reddit


This whole article is some sort of nonsense straw man. Of course if you space small amounts of data spaced out so that they are all far apart you will have bad performance. When people talk about 'alignment' they are talking about 32, 64 or 128 bit for atomics and 128, 256, and 512 bit for SIMD. Putting 8 bytes at the beginning of a 4096 byte section and the next 8 bytes right next to it isn't 'unaligned access'.

Two cache lines of 64 bytes each are cached on even a single byte access, so of course it will be faster to pack data together, but who thought differently? It also saves TLB cache misses as well and if accessed linearly will be prefetched.


This isn't about scanning a table where the elements are allocated contiguously, it's about scanning a linked list where the elements could be anywhere; the next element is almost never going to already be in the cache. That's already bad enough (a cache miss on every step down the list), but those items are all page-alined (they are all in the same position relative to the 4k memory page that they are in), then they all have a lot of memory address bits in common and end up evicting each other out of the cache, slowing your scans down. If you only scan the list once then it's no big deal, but if this is some frequently-accessed kernel data structure (list of processes or threads or GDI objects or what have you), then it's a big problem.

One fix is to ensure that these objects are stored at locations that don't share so many common bits, or for the CPU to actually have a hash function that it uses on the address, rather than using the address as-is.


This must be somewhat related an idea I read about recently. Indexing a binary heap other than the natural way can avoid cache misses and increase priority queue performance: http://playfulprogramming.blogspot.co.uk/2015/08/cache-optim...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: