Everything vardump said is true, but additionally there's the consideration of what isn't there as well as what is there. By not having an out of order execution setup or a broad forwarding network a GPU core is able to be much smaller than a CPU core of equivalent throughput. By "core" here I mean something capable of independently issuing instructions and executing them, not calling each execution unit a "core" the way Nvidia does (though with SIMT that isn't as crazy as it might look at first). That means you can pack more cores on a chip and use less power for a given amount of work, since power consumption is roughly proportional to the number of transistors used. If we eventually start doing graphics on CPUs it will because we move to ray tracing or other sorts of algorithms that GPUs are bad at.
One more thing. GPUs don't have to worry about precise interrupts like if you try to load a piece of memory only to find that you have to page it in or such. If they're in a memory protected setup where that's even possible it'll be the CPU who has to deal with the mess, the GPU can just halt while that's taken care of.
This makes SIMT easy. SIMT is "Single Intstruction Multiple Threads" which is sort of like SIMD but better. You have an instruction come in which is distributed to multiple execution lanes each of which has a hopper of instructions it can draw from. Each lane then executes those instructions independently and if the lane notices that the instruction has been predicated out it just ignores it. Or maybe if one lane is behind it can give the instruction to its neighbor which isn't. THe fact that you don't have to be able to drop everything in a single cycle and pretend you were executing perfectly in order gives the hardware a lot of flexiblity, and all of this complexity is only O(n) with the number of executions instead of O(n^2) as with a typical OoO setup. The need to have precise exceptions involving SIMD instructions means that this isn't a simple thing for a regular CPU core to add to its SIMD units.
The fact that each lane is making decisions about when to execute the instructions it has been issued are why some people refer to the lanes as "cores". I don't, because what they're doing isn't any more complicated than the 8 reservation stations in a typical Tomasulo algorithm OoO CPU core would be doing even if they are smarter than a SIMD lane. With GPUs it makes more sense to break down "cores" by instruction issue, in which case a high end GPU would have dozens rather than thousands of cores.
I think GPUs do have faulting mechanisms, if that's what you meant by "precise interrupts". How else they bus master page memory over PCIe?
> Each lane then executes those instructions independently and if the lane notices that the instruction has been predicated out it just ignores it.
This is exactly what you do on CPUs as well. In SSE/AVX you often mask unwanted results instead of branching. Just like on GPUs. AVX has 8 lanes, 16 with Skylake.
Regarding faulting mechanisms, if you've got a discrete GPU on a PCI bus then it's a separate piece of silicon that handles the network protocol. The important point is that I don't believe that the GPU cores have to be able to halt execution and save their state at any instruction.
It's certainly true that SIMD instructions in CPUs have predication which saves you a lot of trouble. The difference is that if you have two instructions which are predicated in a disjoint way you can execute them both in the same cycle in a SIMT machine but you would have to spend one cycle for each instruction in a SIMD machine. You can look at Dylan16807's link for all the details.
GPUs don't support precise exceptions. For example, you can't take a GPU program that contains a segfault, run it as a standard program (as in, not in a debug mode), and be presented with the exact instruction that generated the fault.
>If we eventually start doing graphics on CPUs it will because we move to ray tracing or other sorts of algorithms that GPUs are bad at.
Even for ray tracing, I think you'll get better results on a modern GPU. Each individual processing element may not be anywhere near as good at dealing with the constant branch prediction failures inherent in ray tracing algorithms as modern CPU cores are, but there are thousands of them for every CPU core you have, and millions of completely independent samples to compute in a raytraced image. But I suspect that by the time that realtime raytracting becomes viable, the HSA approach I outlined above will be viable as well, and that will be the way to go.
I usually call nvidia's SMX a core. Nvidia's warp/hw thread architecture is sure interesting, but one has to remember it's mostly to get around memory bottlenecks. All those contexts need also a lot of gates. GDDR5 has high throughput but has an order of magnitude higher latency.
Ray-tracing doesn't really buy much unless you're doing some specific subset of effects. Rasterization covers 99% of cases with higher efficiency.