I am unfamiliar with the memory patterns in the application, but: 600x improveme...

I am unfamiliar with the memory patterns in the application, but: 600x improvement in performance does not have to come from increase in processing power.

If the algorithms have a lot of data reuse in their matrix computations, I can see achieving 600x improvement when compared to a hardware cache based architecture. If the CPU implementation doesn't do tiling (http://en.wikipedia.org/wiki/Loop_tiling) effectively (or it can't) then it's going to shuttle a lot of data back and forth from cache to RAM.

The accepted term for this effect is super-linear speedup.