I'm not sure what disappointment you're predicting. Unless your GPU is connected through a cache coherent protocol like CXL to your CPU, you are unlikely to make your code run faster by transferring data back to the CPU and back again to the GPU. You have 128 compute units on the 4090, even at a lower frequency and higher memory latency, you will probably not end up too far away from the performance of an 8 core CPU running at 4.5GHz. Nobody is running millions of CPU threads in the first place, so you seem to be completely misunderstanding the workload here. Nobody wants to speed up their CPU code by running it on the GPU, they want to stop slowing down their GPU code by waiting for data transfers to and from the CPU.