Sorry to correct your math here, but going from 6 hours (6\*60=360 minutes) to 4...

ath92 · on Oct 13, 2020

Interestingly, a speedup of 2 orders of magnitude seems to be about what you'd expect if you only apply the algorithmic optimization (going from N^2 to Nlog(N) with N being in the range of a few thousand, like OP said)

mynegation · on Oct 13, 2020

Like I mentioned - my memory is pretty hazy about the specific speedups.

Yes, N^2 to NlogN feels like it should be a bigger factor, but remember that constants in front of this matter. You are replacing 2-level nested but very straightforward loops that rip through cached memory at blazing speed without any disruption to CPU pipelines with a single loop that involves quite a bit of condition checking and array element swaps (cache is busted, pipelines are busted - conditions are very hard to predict). You gain some, you lose some.