I might be somehow miscounting, but it seems to me that for the slow implementat...

ascar · on May 30, 2022

oh you're right, my bad. Sloppy on my part!

I somehow got tripped up by the 4xSIMD. I was assuming you meant it's using 4x 64bit SIMD there which it doesn't. mulpd and addpd are 2x 64bit, also visible by the xmm instead of ymm registers.

I got sloppy on the difference between all instructions including the loop logic vs just the instructions necessary to do the main computation. Obviously the first is the correct measure and I was sloppy.

tomxor · on May 30, 2022

I think the confusion might be coming from mixing "instructions" with "operations", which are not equivalent especially since we are discussing SIMD which packs in multiple operations (single instruction operating on multiple data). If you re-read this thread and replace instructions with operations it makes more sense.