This (2020) reports "Despite the fact that we send the entire data array to the video card and back, sorting on GPU of 800 MB of data is performed about 25-fold faster than on the processor."
Thanks for the example! Sounds like 1.6 GB/s on an entire Tesla K80 (300W TDP).
This is in fact several times slower than our results on Skylake (with half the TDP), but note that K80 is from 2014.
The "25-fold speedup", as is often the case for such reports, comes from not optimizing the CPU side.
For 64-bit keys, we sort about 1 GB/s per (5 year old) Skylake core, and perhaps 5-6 parallel.
This (2018) reports 3.5 GB/s: https://benkarsin.files.wordpress.com/2018/10/dissertation.p... And a 6-year old GPU radix sort reports 2.1 GB/s: https://github.com/Bulat-Ziganshin/Compression-Research/tree...
BTW I've worked on a product that used GPUs. That typically requires everything to move to the GPU, which is not always desirable or feasible.