I mean, did you see the very complicated extremely optimized C and C++ code lowe...

saagarjha · on Oct 29, 2021

The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.

cormacrelf · on Oct 29, 2021

Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.

DeathArrow · on Oct 29, 2021

It's not ASM that make the code fast, it's the way he laid data and code. C/C++ should be able to approach 90% the speed of this.

gpderetta · on Nov 1, 2021

most other implementations do not use splicevm, which is a huge win for this specific problem.

Of course all the AVX and cache optimizations are also exceedingly clever.