I tried CUDA programming as well and found that the threading model and such is an utter nightmare. For certain tasks its quite easy, such as upscaling or filtering an image. However, such a task is exactly the kind of task where you'll end up bandwidth-limited anyways. The kind of complex tasks where you actually fully use the GPU processors are often the exact kind of situations where the API will work against you every step of the way.
The "960 cores" moniker is also very misleading, as last I recall there were really only (960/8) cores, with each core being able to run 8 instructions at the same if all the instructions were exactly the same.
What are some of these complex tasks you had in mind? I think that almost any computation-bound task, however complex, could benefit from GPU acceleration. But I would like to hear about exceptions.
A motion search, for video compression, is what I was trying.
It has some huge disadvantages:
1. The threading model is completely unsuited to a search that takes a different number of iterations per block (since it wants all the threads doing the same thing).
2. CPUs already have the PSADBW instruction, which allows an absurd effective throughput: its literally an instruction dedicated to this kind of task. Its also a purely 8-bit integer problem, so it doesn't benefit from the high-performance floating point units on the GPU.
Due to 1), you're pretty much restricted to either crippling your performance, using a very simplified search, or using an exhaustive search. And if you're using an exhaustive search, it turns out that there's a mathematically equivalent and vastly faster way to do it called "sequential elimination"... which is completely impractical to implement on a GPU as well due to its linear nature, and which allows a Core 2 to vastly outperform a GPU and possibly even be competitive with a dedicated FPGA doing a normal exhaustive search.
Hmmm.. OK. You mind posting a link to the pseudo-code of the algorithm? The problem of different number of iterations per thread is quite common, but can often be fixed.
If all threads in a block have the same number if iterations, you're fine; different blocks can take different number of iterations. As long as the number of blocks is much higher than the number of SMs, the machine will dynamically schedule the blocks to available SMs.
If each thread takes a different number of iterations, it's more difficult, but you can still do dynamic allocation yourself, by running as many blocks as you have SMs, and having each thread pick up work as it needs. You can also randomize the assignment of work to threads, so the expected amount of work is roughly the same. This all depends on the particular problem and data, though...
As for the 8-bit nature of the problem - this is true, you're not utilizing the floating point units that are the biggest advantage of the GPU. How large are the vectors you need to do PSADBW over?
If all threads in a block have the same number if iterations, you're fine; different blocks can take different number of iterations. As long as the number of blocks is much higher than the number of SMs, the machine will dynamically schedule the blocks to available SMs.
At this point the programmer's head has already exploded. MIMD systems (like OpenMP on Larrabee) don't have any of this BS.
No, Larrabee will have exactly the same issues. Except threads will be called fibers, and blocks will be called threads. Read the Larrabee paper from Siggraph 08.
The way to get maximum performance out of a given chip area is to use SIMD, and that's here to stay, with all the associated issues.
The Larrabee SIMD width will be 16 and NVIDIA's is currently 32, so that's almost the same.
Not just the rasterizer, but any application that wants to take full advantage of Larabee will have to use the SIMD vector units to the max.
Larrabee might turn out to have great performance (which I hope), but if it does, the reason will not be black magic or breaking laws of physics. The reason will be SIMD.
For some reason it's easier for me to wrap my head around one thread driving a 16-wide SIMD unit than 32 threads that execute in lockstep. I know it ends up being equivalent but it feels different.
Also, on Larrabee you can execute a different kernel on each core, while on GPUs you can't.
A relatively simple motion search, known as "EPZS" or "Diamond", is as follows.
Let us define the SAD function as follows: it takes the source block to be compared to, and a candidate block location in a reference frame. Each block is a 16x16 array of 8-bit unsigned values with a large stride (since its smack in the middle of a much larger reference image). SAD(mx,my) means compare the current block to the reference block located at <mx,my>, where <0,0> is the colocated block in the reference frame. The function itself sums up, for each 8-bit value (x from 0 to 255), abs(source[x]-ref[x]).
Note that SAD is always run on unaligned data (well, 15/16 of the time). For speed purposes, SADs are usually done in groups of 4. With this method, on modern CPUs, assuming that the data is in cache, a, unaligned 16x16 SAD takes:
(Don't have the Phenom numbers on me, but its around the high 30s)
Also note that in practice, the SAD function, at the end, adds the approximate bit cost of the motion vector to the candidate score. This is very important, because this bit cost depends on its difference from the predicted motion vector.
Diamond search:
For the current macroblock:
1. Start at the predicted motion vector. Let this value be <mx,my>. Set bsad equal to SAD(mx,my).
2. Set bsad equal to the lowest of SAD(mx-1,my), SAD(mx+1,my), SAD(mx,my-1), SAD(mx,my+1), and the previous bsad.
3. Set the new mx and my equal to the ones which gave the lowest SAD value.
4. If mx and my did not change in 2), terminate. Otherwise, GOTO 2.
This has two non-CUDA-friendly portions to it:
1. You must know the predicted motion vector: you can't simply search all blocks independently, as the predicted motion vector depends on the top, left, top right, and left blocks relative to the current one.
2. Some blocks might need 1 iteration, some might need 15.
It gets worse when you realize how damn fast the CPU can do SADs (38 clocks to process 512 pixels, on a single core!). Note that using the exhaustive search I mentioned above, sequential elimination, a modern CPU can get this value down to an "effective" 5-6 clocks per 16x16 block SAD.
Now, from my reverse engineering based on output bitstreams, I know approximately how Badaboom, the (rather awful) nVidia CUDA GPU encoder, does its motion search.
1. Set n = 16.
2. Downscale the image by a factor of n.
3. Do the above algorithm on this image.
4. Divide n by 2.
5. Downscale the image by a factor of n.
6. Each motion vector from before now represents the vector on 4 blocks.
7. Refine the motion vectors on these four blocks by a constant X number of iterations of diamond search (I'm guessing one or two).
8. If n isn't 1, GOTO 4.
Note how much more GPU-friendly this algorithm is. Its called a pyramidal search. It has many disadvantages though, such as the fact that it results in many false motion vectors (as the highly-downscaled search can move a whole bunch of blocks in a given direction due to some small motion in the center of that bunch of blocks, and then that vector doesn't get reversed back to normal in the lower level searches).
The "960 cores" moniker is also very misleading, as last I recall there were really only (960/8) cores, with each core being able to run 8 instructions at the same if all the instructions were exactly the same.