Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've written a benchmark to measure such thins and from what I can tell.

Each fast core has a L1D of 128KB.

The fast cores have a cluster with 12MB, cache misses to to main memory.

The slow cores have a 4MB L2.

The cache misses from the fast L2 can't quite saturate the main memory systems (I believe it's 8 channels of 16 bits). So when all cores are busy you keep 12MB of L2 for fast, 4MB of L2 for the slow cores, and end up getting better throughput from the memory system since you are keeping all 8 channels busy.



Wonder if the SLC is mostly used for coherency purposes and the other blocks then...

And yeah, it's 128-bit wide LPDDR4X-4266, pretty quick imo.


Not just 128 bits wide (standard on high end laptops and most desktops), but 8 channels. The latency is halved and over the last decades I've only been seeing very modest improvements in latency to main memory on the order of 3-5% a year.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: