Interestingly, a speedup of 2 orders of magnitude seems to be about what you'd expect if you only apply the algorithmic optimization (going from N^2 to Nlog(N) with N being in the range of a few thousand, like OP said)
Like I mentioned - my memory is pretty hazy about the specific speedups.
Yes, N^2 to NlogN feels like it should be a bigger factor, but remember that constants in front of this matter. You are replacing 2-level nested but very straightforward loops that rip through cached memory at blazing speed without any disruption to CPU pipelines with a single loop that involves quite a bit of condition checking and array element swaps (cache is busted, pipelines are busted - conditions are very hard to predict). You gain some, you lose some.