I had a pretty big post going, but vardump and Symmetry conveyed the gist of what I was going for much more succinctly, so I'll summarize:[1]
GPU cores are tiny because the problems they deal with are "embarrassingly parallel", trivially solved by throwing more cores at the problem. You make the cores as simple as possible so you can have thousands of them on a chip. Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now. You'd think that graphics would be the perfect scenario for SIMD, since shaders spend so much time dealing with 3 and 4D vectors, transformation matrices, colors, and so on, but it worked out that the gain in throughput and instruction density per-core was outweighed by the power, heat, and die cost of having parts of those thousands of SIMD units sitting idle while working on data that doesn't take up a whole SIMD register. And because context switches are much rarer on GPUs than CPUs, they can have extremely deep pipelines that push compute efficiency even further, at the expense of context switch latency.
CPU cores are, if I may, "fuckhuge", because by and large they can't solve their problems by throwing more cores at them. They take on problems that are inherently serial and branch heavy, like compiling a program or optimally compressing a large file, and throw bigger cores at them (out-of-order execution, branch prediction, speculative execution, multiple ALUs per core, instruction schedulers that exploit the parallelism hidden in the serial instruction stream, large register files only visible to the microarchitecture, etc) while maintaining a fairly short pipeline so that branch prediction failures and context switches don't take too long to recover from. SIMD fits in well here, because the cost of bigger and bigger ALUs is pretty much insignificant compared to all the other hardware that goes into a high-end CPU core. It can be a pain to optimize for, but it's great for middle-ground tasks that need both heavy parallel and serial/branch-heavy computing resources with little latency between the two, like video compression.
So you can't make one as good at solving the tasks of the other without making it worse at the task it's already intended to do. They tried to do this with Larabee, which was, no exaggeration, a few hundred Pentiums on a chip. Very interesting, but unsuccessful in the market, because it was an expensive niche product that was worse than either a CPU or GPU for the majority of their respectful workloads. What's really interesting, however, is the direction that AMD is taking. Rather than pushing SIMD like Intel, they're trying with their "heterogeneous systems architecture" movement to break down the communication barrier between CPUs and GPUs so that sharing data between the two is as easy as passing a pointer. SIMD is still great on CPUs for dealing with higher level primitives like vectors, matrices, or image macroblocks, but I can easily see the aforementioned middle-ground shifting to CPU threads working in close concert with GPU worker threads, as opposed to the mostly hands-off, one way street common in i.e. 3D rendering today.
[1] On second thought, this didn't turn out to be much of a summary, did it?
>>Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now.
This is factually incorrect. Let me quote Vasily Volkov [1]: "Earlier GPUs had a 2-level SIMD architecture — an SIMD array
of processors, each operating on 4-component vectors. Modern
GPUs have a 1-level SIMD architecture — an SIMD array of
scalar processors. Despite this change, it is the overall SIMD
architecture that is important to understand."
I'm aware of AMD's terminology here, and I disagree with it. They say "SIMD array of scalar processors" where "SIMT" makes much more logical sense to me. I believe the only reason that they maintain this "2-level vs. 1-level SIMD" distinction is because NVidia coined the term SIMT, so they refuse to use it due to marketing concerns/some possible trademark or other IP law liability/pride.
...and now I read your paper, and I need to eat humble pie, because it apparently predates NVidia's use of the term "SIMT." Apparently, even if it makes more logical sense, SIMT is the buzzword!
Either way, I think it's a stretch to say I'm "factually incorrect" just because I used different terminology. Whether you think of it as an "SIMD array of scalar processors", or (to use NVidia terminology) an "SIMT warp of scalar threads" the important thing here is that the execution units in the bundle with the shared instruction pointer operate on scalar values now, rather than 4-element vectors.
What GPU vendors call cores aren't. It's like calling hair brush pins as hair brushes.
I think something should be called "core" if it can branch instruction stream. Predication doesn't count. A single core can run multiple instruction streams (GPUs, Intel Hyperthreading).
You can compile different files in parallel. You can probably even compile a single file in parallel. Consider this: If you start reading in the middle of a source file, can you make sense of what you see, or is it incomprehensible garbage unless you start at the top?
The answer is that it mostly makes sense except you may be tricked by #defines for example. Thus you could probably "optimistically parse" small chunks of the source code, all in parallel, with the understanding that you may have to throw out some intermediate results in light of new understandings.
Yeah, of course there are tasks that can be executed with coarse-grained parallelism, that can take advantage of multiple threads. It's not embarrassingly parallel though like with graphics, where you have millions of pixels that can be shaded independently of all the other pixels in an image, per rendering pass. In an optimizing compiler, compiled functions are not independent of each other, because data flow analysis will affect code generation, so you can't just compile each function on a different thread and combine the results. At least on a per-file basis, compilation has an inherently serial bottleneck, and that's not considering that 1. not all languages have separate compilation and linking like the C family, and 2. even with whole program optimization disabled, the linking phase can be fairly complicated in modern linker, so you have yet another serial bottleneck. And if you're implementing an non-optimizing compiler, then even with the naive, completely serial compiler, the whole process won't take long enough for it to be worth going to this trouble.
Glitch works this way. It will simply redo the tree when better data is available (via dependency tracing); so you can parse, type check various parts within and between files using what is essentially optimistic parallelism.
This is quite useful for non regular problems that are otherwise difficult to parallelize.
I should have been clear, I'm talking about Intel and AMD's monster desktop and server-class CPUs here, not mobile. For that matter, mobile GPUs are very different from desktop/console/HPC-class GPUs as well.
GPU cores are tiny because the problems they deal with are "embarrassingly parallel", trivially solved by throwing more cores at the problem. You make the cores as simple as possible so you can have thousands of them on a chip. Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now. You'd think that graphics would be the perfect scenario for SIMD, since shaders spend so much time dealing with 3 and 4D vectors, transformation matrices, colors, and so on, but it worked out that the gain in throughput and instruction density per-core was outweighed by the power, heat, and die cost of having parts of those thousands of SIMD units sitting idle while working on data that doesn't take up a whole SIMD register. And because context switches are much rarer on GPUs than CPUs, they can have extremely deep pipelines that push compute efficiency even further, at the expense of context switch latency.
CPU cores are, if I may, "fuckhuge", because by and large they can't solve their problems by throwing more cores at them. They take on problems that are inherently serial and branch heavy, like compiling a program or optimally compressing a large file, and throw bigger cores at them (out-of-order execution, branch prediction, speculative execution, multiple ALUs per core, instruction schedulers that exploit the parallelism hidden in the serial instruction stream, large register files only visible to the microarchitecture, etc) while maintaining a fairly short pipeline so that branch prediction failures and context switches don't take too long to recover from. SIMD fits in well here, because the cost of bigger and bigger ALUs is pretty much insignificant compared to all the other hardware that goes into a high-end CPU core. It can be a pain to optimize for, but it's great for middle-ground tasks that need both heavy parallel and serial/branch-heavy computing resources with little latency between the two, like video compression.
So you can't make one as good at solving the tasks of the other without making it worse at the task it's already intended to do. They tried to do this with Larabee, which was, no exaggeration, a few hundred Pentiums on a chip. Very interesting, but unsuccessful in the market, because it was an expensive niche product that was worse than either a CPU or GPU for the majority of their respectful workloads. What's really interesting, however, is the direction that AMD is taking. Rather than pushing SIMD like Intel, they're trying with their "heterogeneous systems architecture" movement to break down the communication barrier between CPUs and GPUs so that sharing data between the two is as easy as passing a pointer. SIMD is still great on CPUs for dealing with higher level primitives like vectors, matrices, or image macroblocks, but I can easily see the aforementioned middle-ground shifting to CPU threads working in close concert with GPU worker threads, as opposed to the mostly hands-off, one way street common in i.e. 3D rendering today.
[1] On second thought, this didn't turn out to be much of a summary, did it?