NVIDIA Announces Tesla Personal Supercomputer

bootload · on Nov 19, 2008

I took a look at the Tesla after viewing an Nvidia demo with the Mythbusters Adam & Jamie doing a simple demo of CPU v's GPU which you can see here ~ http://www.youtube.com/watch?v=fKK933KK6Gg

Firstly you can run the thing as either a card (cheaper, slower) or standalone machine (expensive, nobody lists the price). The CUDA toolkit which is C based is Win/Lin 32/64 bit compatible and available for most mainstream disto's ~ http://www.nvidia.com/object/cuda_learn.html

"... The Tesla architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs) ..."

If you think this is a similar to mainstream development think again. Programming with this machine is right down to the metal and reminds me of programming the PS2 and other specialised consoles. You need to spend some time reading the Hardware manuals, understand the architecture to use the machine to it's capability.

SirWart · on Nov 19, 2008

I had some friends who tried to port the Linpack benchmark to a small cluster of computers each with 8 NVIDIA GPUs using CUDA, and they found that the biggest bottleneck was the bandwidth to and from the GPUs. It's just hard to keep the GPUs fed with enough data. They confirmed that they are both hard to program and incredibly powerful.

DarkShikari · on Nov 19, 2008

I tried CUDA programming as well and found that the threading model and such is an utter nightmare. For certain tasks its quite easy, such as upscaling or filtering an image. However, such a task is exactly the kind of task where you'll end up bandwidth-limited anyways. The kind of complex tasks where you actually fully use the GPU processors are often the exact kind of situations where the API will work against you every step of the way.

The "960 cores" moniker is also very misleading, as last I recall there were really only (960/8) cores, with each core being able to run 8 instructions at the same if all the instructions were exactly the same.

miloshh · on Nov 19, 2008

What are some of these complex tasks you had in mind? I think that almost any computation-bound task, however complex, could benefit from GPU acceleration. But I would like to hear about exceptions.

DarkShikari · on Nov 19, 2008

A motion search, for video compression, is what I was trying.

It has some huge disadvantages:

1. The threading model is completely unsuited to a search that takes a different number of iterations per block (since it wants all the threads doing the same thing).

2. CPUs already have the PSADBW instruction, which allows an absurd effective throughput: its literally an instruction dedicated to this kind of task. Its also a purely 8-bit integer problem, so it doesn't benefit from the high-performance floating point units on the GPU.

Due to 1), you're pretty much restricted to either crippling your performance, using a very simplified search, or using an exhaustive search. And if you're using an exhaustive search, it turns out that there's a mathematically equivalent and vastly faster way to do it called "sequential elimination"... which is completely impractical to implement on a GPU as well due to its linear nature, and which allows a Core 2 to vastly outperform a GPU and possibly even be competitive with a dedicated FPGA doing a normal exhaustive search.

miloshh · on Nov 19, 2008

Hmmm.. OK. You mind posting a link to the pseudo-code of the algorithm? The problem of different number of iterations per thread is quite common, but can often be fixed.

If all threads in a block have the same number if iterations, you're fine; different blocks can take different number of iterations. As long as the number of blocks is much higher than the number of SMs, the machine will dynamically schedule the blocks to available SMs.

If each thread takes a different number of iterations, it's more difficult, but you can still do dynamic allocation yourself, by running as many blocks as you have SMs, and having each thread pick up work as it needs. You can also randomize the assignment of work to threads, so the expected amount of work is roughly the same. This all depends on the particular problem and data, though...

As for the 8-bit nature of the problem - this is true, you're not utilizing the floating point units that are the biggest advantage of the GPU. How large are the vectors you need to do PSADBW over?

wmf · on Nov 19, 2008

If all threads in a block have the same number if iterations, you're fine; different blocks can take different number of iterations. As long as the number of blocks is much higher than the number of SMs, the machine will dynamically schedule the blocks to available SMs.

At this point the programmer's head has already exploded. MIMD systems (like OpenMP on Larrabee) don't have any of this BS.

miloshh · on Nov 20, 2008

No, Larrabee will have exactly the same issues. Except threads will be called fibers, and blocks will be called threads. Read the Larrabee paper from Siggraph 08.

The way to get maximum performance out of a given chip area is to use SIMD, and that's here to stay, with all the associated issues.

wmf · on Nov 20, 2008

Except threads will be called fibers, and blocks will be called threads.

That's how their rasterizer works, but I'm talking about using it as a regular x86.

The way to get maximum performance out of a given chip area is to use SIMD, and that's here to stay

There's a big difference between MIMD+narrow SIMD and super wide SIMD.

miloshh · on Nov 20, 2008

The Larrabee SIMD width will be 16 and NVIDIA's is currently 32, so that's almost the same.

Not just the rasterizer, but any application that wants to take full advantage of Larabee will have to use the SIMD vector units to the max.

Larrabee might turn out to have great performance (which I hope), but if it does, the reason will not be black magic or breaking laws of physics. The reason will be SIMD.

wmf · on Nov 21, 2008

For some reason it's easier for me to wrap my head around one thread driving a 16-wide SIMD unit than 32 threads that execute in lockstep. I know it ends up being equivalent but it feels different.

Also, on Larrabee you can execute a different kernel on each core, while on GPUs you can't.

DarkShikari · on Nov 19, 2008

A relatively simple motion search, known as "EPZS" or "Diamond", is as follows.

Let us define the SAD function as follows: it takes the source block to be compared to, and a candidate block location in a reference frame. Each block is a 16x16 array of 8-bit unsigned values with a large stride (since its smack in the middle of a much larger reference image). SAD(mx,my) means compare the current block to the reference block located at <mx,my>, where <0,0> is the colocated block in the reference frame. The function itself sums up, for each 8-bit value (x from 0 to 255), abs(source[x]-ref[x]).

Note that SAD is always run on unaligned data (well, 15/16 of the time). For speed purposes, SADs are usually done in groups of 4. With this method, on modern CPUs, assuming that the data is in cache, a, unaligned 16x16 SAD takes:

  Nehalem: 38 clocks
  Penryn: 49 clocks
  Conroe: 52 clocks

(Don't have the Phenom numbers on me, but its around the high 30s)

Also note that in practice, the SAD function, at the end, adds the approximate bit cost of the motion vector to the candidate score. This is very important, because this bit cost depends on its difference from the predicted motion vector.

Diamond search: For the current macroblock:

  1.  Start at the predicted motion vector.  Let this value be <mx,my>.  Set bsad equal to SAD(mx,my).
  2.  Set bsad equal to the lowest of SAD(mx-1,my), SAD(mx+1,my), SAD(mx,my-1), SAD(mx,my+1), and the previous bsad.
  3.  Set the new mx and my equal to the ones which gave the lowest SAD value.
  4.  If mx and my did not change in 2), terminate.  Otherwise, GOTO 2.

This has two non-CUDA-friendly portions to it:

1. You must know the predicted motion vector: you can't simply search all blocks independently, as the predicted motion vector depends on the top, left, top right, and left blocks relative to the current one.

2. Some blocks might need 1 iteration, some might need 15.

It gets worse when you realize how damn fast the CPU can do SADs (38 clocks to process 512 pixels, on a single core!). Note that using the exhaustive search I mentioned above, sequential elimination, a modern CPU can get this value down to an "effective" 5-6 clocks per 16x16 block SAD.

Now, from my reverse engineering based on output bitstreams, I know approximately how Badaboom, the (rather awful) nVidia CUDA GPU encoder, does its motion search.

  1.  Set n = 16.
  2.  Downscale the image by a factor of n.
  3.  Do the above algorithm on this image.
  4.  Divide n by 2.
  5.  Downscale the image by a factor of n.
  6.  Each motion vector from before now represents the vector on 4 blocks.
  7.  Refine the motion vectors on these four blocks by a constant X number of iterations of diamond search (I'm guessing one or two).
  8.  If n isn't 1, GOTO 4.

Note how much more GPU-friendly this algorithm is. Its called a pyramidal search. It has many disadvantages though, such as the fact that it results in many false motion vectors (as the highly-downscaled search can move a whole bunch of blocks in a given direction due to some small motion in the center of that bunch of blocks, and then that vector doesn't get reversed back to normal in the lower level searches).

miloshh · on Nov 20, 2008

OK, this will take me some time to think about, but I'll get back to you...

brfox · on Nov 19, 2008

Protein sequence alignment implemented (open source):

http://www.biomedcentral.com/1471-2105/9/S2/S10

michaelneale · on Nov 19, 2008

you mention ps2 - is the ps3 similar? (although its powerpc - there are 7 cores to use - so is it still as tricky?)

DarkShikari · on Nov 19, 2008

The Cell is a whole different type of nightmare, with various interesting properties:

1. Bitshifts by variable amounts take ~7 clocks each on the main PPU.

2. The SPUs have no cache whatsoever; all caching has to be done explicitly, and all memory access is DMA'd.

3. The SPUs do scalar math no faster than SIMD (in fact, from what I know, scalar math is just calling an SIMD function on a single value).

4. There is no instruction reordering on SPUs, and everything has to be synced exactly for max performance (certain instructions run on odd clock cycles, others on even cycles).

5. The SPUs are not Altivec chips. The SPU SIMD instruction set is similar to Altivec, but more versatile. Of course, this means you can't just run existing Altivec code on them.

6. Overall, the integer SIMD on SPUs is much slower than that on modern Intel processors. Not sure about float, as I have no experience in that arena.

michaelneale · on Nov 20, 2008

Wow that sounds like a world of fun. For some definition of fun. The one that shocked me:

"1. Bitshifts by variable amounts take ~7 clocks each on the main PPU" If you didn't know that, that could really bite hard.

jaydub · on Nov 19, 2008

the PS3 uses the Cell processor which is also tricky to program

noonespecial · on Nov 19, 2008

I like Tesla. He was under-appreciated in his time, brilliant, and did things just for the joy of the science.

I just wish they'd stop with the naming of products, rock bands, and breakfast cereals, etc. after him. Its starting to wear thin.

jm4 · on Nov 19, 2008

This sounds cool, but the page linked to is only a standard press release and the link to the interesting stuff is buried in the marketing-speak. The fun stuff is here: http://www.nvidia.com/object/personal_computing.html

light3 · on Nov 19, 2008

It'll be interesting to see how much these will cost. 4 Teraflops of single precision calculation is very impressive, however if you need double precision, the speed is not quite as impressive at 400 Gflops - although still very good compared to my laptop with 20 Gflops :)

The main drawback with these is that you have to use CUDA, which certainly takes a while before you wrap your mind across. I played CUDA for a while but considered it is too much effort for something which is very specialised and might not become mainstream. Still there seems to be many people using CUDA and with lots of research roles - see the CUDA forums.

kahseng · on Nov 19, 2008

It says "Available from VARs worldwide for under $10,000" in here http://www.nvidia.com/object/personal_computing.html

Now I know how it felt like in the '80s when people were looking at the mainframes/desktops of the time and wondering... can I afford this $10K machine? :)

tsally · on Nov 19, 2008

I guess I am a little skeptical about the market. First, I have to believe it is rare that an individual would need a super computer. Besides, any individual that actually needs one is probably going to build his/her own. Businesses might want these on a large scale, but then why market it as a personal computer?

wmf · on Nov 19, 2008

Lots of scientists, engineers, and finance people could use these. It's personal in the sense that it's used by one person, not that people would buy it for personal use.

Also, it doesn't really matter whether you build or buy; either way all the cost is in the Tesla card and NVidia gets their money.

LPTS · on Nov 19, 2008

"They'll never sell more than 5 of those pc computers"

riobard · on Nov 19, 2008

Reading the title, I initially thought it was something related to the electric car ...

Anyway, I guess the problem is that most programmers have no clue of how to program GPU. Programming multi-core CPU is already very hard, and now comes the GPU stuff ...

jcromartie · on Nov 19, 2008

960 cores. Wow. I thought that I was in over my head trying to program 2 at a time!

jodrellblank · on Nov 19, 2008

What was it Joel Spolsky said about programmers counting "It must work for 1. Oh, there's more than 1? Then it must work for any number".

;)