At the time, yes, but I do not have those numbers anymore. Trying my best to remember - I would estimate that the speedup of 360 / 4 = 90x is approximately: 4x multithreading (it was pretty linear speedup), 7x the algorithm and 3x the memory inefficiencies.
Interestingly, a speedup of 2 orders of magnitude seems to be about what you'd expect if you only apply the algorithmic optimization (going from N^2 to Nlog(N) with N being in the range of a few thousand, like OP said)
Like I mentioned - my memory is pretty hazy about the specific speedups.
Yes, N^2 to NlogN feels like it should be a bigger factor, but remember that constants in front of this matter. You are replacing 2-level nested but very straightforward loops that rip through cached memory at blazing speed without any disruption to CPU pipelines with a single loop that involves quite a bit of condition checking and array element swaps (cache is busted, pipelines are busted - conditions are very hard to predict). You gain some, you lose some.
EDIT: fixed my math, thank you andruby