I had a pretty big post going, but vardump and Symmetry conveyed the gist of wha...

Marat_Dukhan · on Jan 12, 2015

>>Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now.

This is factually incorrect. Let me quote Vasily Volkov [1]: "Earlier GPUs had a 2-level SIMD architecture — an SIMD array of processors, each operating on 4-component vectors. Modern GPUs have a 1-level SIMD architecture — an SIMD array of scalar processors. Despite this change, it is the overall SIMD architecture that is important to understand."

[1] http://parlab.eecs.berkeley.edu/sites/all/parlab/files/LU,%2...

ANTSANTS · on Jan 12, 2015

I'm aware of AMD's terminology here, and I disagree with it. They say "SIMD array of scalar processors" where "SIMT" makes much more logical sense to me. I believe the only reason that they maintain this "2-level vs. 1-level SIMD" distinction is because NVidia coined the term SIMT, so they refuse to use it due to marketing concerns/some possible trademark or other IP law liability/pride.

...and now I read your paper, and I need to eat humble pie, because it apparently predates NVidia's use of the term "SIMT." Apparently, even if it makes more logical sense, SIMT is the buzzword!

Either way, I think it's a stretch to say I'm "factually incorrect" just because I used different terminology. Whether you think of it as an "SIMD array of scalar processors", or (to use NVidia terminology) an "SIMT warp of scalar threads" the important thing here is that the execution units in the bundle with the shared instruction pointer operate on scalar values now, rather than 4-element vectors.

vardump · on Jan 12, 2015

What GPU vendors call cores aren't. It's like calling hair brush pins as hair brushes.

I think something should be called "core" if it can branch instruction stream. Predication doesn't count. A single core can run multiple instruction streams (GPUs, Intel Hyperthreading).

ANTSANTS · on Jan 12, 2015

> What GPU vendors call cores aren't. It's like calling hair brush pins as hair brushes.

I agree. In the past I posted a decent (if I do say so myself) introduction to hyper threading and SIMT here: https://news.ycombinator.com/item?id=8245360

I simplified and just said "core" in this post because it was getting a bit long in the tooth already.

im2w1l · on Jan 12, 2015

You can compile different files in parallel. You can probably even compile a single file in parallel. Consider this: If you start reading in the middle of a source file, can you make sense of what you see, or is it incomprehensible garbage unless you start at the top?

The answer is that it mostly makes sense except you may be tricked by #defines for example. Thus you could probably "optimistically parse" small chunks of the source code, all in parallel, with the understanding that you may have to throw out some intermediate results in light of new understandings.

ANTSANTS · on Jan 12, 2015

Yeah, of course there are tasks that can be executed with coarse-grained parallelism, that can take advantage of multiple threads. It's not embarrassingly parallel though like with graphics, where you have millions of pixels that can be shaded independently of all the other pixels in an image, per rendering pass. In an optimizing compiler, compiled functions are not independent of each other, because data flow analysis will affect code generation, so you can't just compile each function on a different thread and combine the results. At least on a per-file basis, compilation has an inherently serial bottleneck, and that's not considering that 1. not all languages have separate compilation and linking like the C family, and 2. even with whole program optimization disabled, the linking phase can be fairly complicated in modern linker, so you have yet another serial bottleneck. And if you're implementing an non-optimizing compiler, then even with the naive, completely serial compiler, the whole process won't take long enough for it to be worth going to this trouble.

seanmcdirmid · on Jan 13, 2015

Glitch works this way. It will simply redo the tree when better data is available (via dependency tracing); so you can parse, type check various parts within and between files using what is essentially optimistic parallelism.

This is quite useful for non regular problems that are otherwise difficult to parallelize.

esmi · on Jan 12, 2015

I'm not so sure this is true in general. If you believe chip works, it's definitely false for Apple's A8. http://www.anandtech.com/show/8562/chipworks-a8

ANTSANTS · on Jan 12, 2015

I should have been clear, I'm talking about Intel and AMD's monster desktop and server-class CPUs here, not mobile. For that matter, mobile GPUs are very different from desktop/console/HPC-class GPUs as well.