Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia Tesla Supercomputer (960 cores) (nvidia.co.uk)
38 points by wesleyd on March 27, 2009 | hide | past | favorite | 27 comments



(I can't see any of the 11 comments so forgive me if post is redundant in some way)

Visual Effects production often needs lots of computing resources for rendering. This is a task that can be heavily parallelized. Usually this is done on a frame/per core basis, with memory being the limiting factor. Now I've started using EC2 recently which made me realize a different approach to rendering tasks.

Normally a facility has a fixed number of CPUs available to run jobs. So for example 100 cpus. Let's say the average number of frames to render is usually around 200. So if each frame takes an hour to render and your's is the only job on the queue then it will take 2 hours to see the complete shot. And it's much worse in the real world with many artists working on many shots. Whereas using a resource like EC2 you can always render your entire shot in 1 hour regardless of the number of frames. You're only limited by the time it takes 1 frame to render, and the cost is the same if you use 1 cpu or 200.

In other words one can trade depth for breadth. Now this may seem obvious to anyone familiar with EC2, but for me it means that tools like Tesla which where once in high demand for this type of work are now much less valuable. I would expect that it's price/performance is much better then EC2, but where is the cutoff? How many hours do you have to run that Tesla to come out ahead? And if you're running 1 Tesla for that many hours, might it not be worth a premium to get your answer sooner by running more massively parallel (but for less wall clock time) on an EC2 like service?

I suspect these types of tools becoming only relevant to real time applications (due to reduced latency vs. EC2), and or nonstop computing (assuming there's a big price/performance win).


Pretty soon someone will rent you GPUs by the hour, at which point you'd have the advantages of elasticity and price/performance.


That actually sounds like a neat business.


It's easy to imagine EC2 expanding beyond 1U servers and databases to GPGPUs, FGPAs or quantum cores. There's a lot of potential for a gpu cloud - from quick rendering (xtraNormal/nVidia Gelato) to scientific computing (Larrabee?).


Why stop at one frame per core? Once you have easily scalable renderfarms, you could switch the the somewhat slower (though currently less heavily optimized) raytracing approach, and run almost as many cores per frame as you like. That way the only thing limiting your render time is your budget (not that that hasn't always been true, but your set up makes it sound like there is a "rock bottom" limit that you reach before the end of your budget at one hour per frame).


Most modern high-end offline rasterizers aren't forward renderers, or deferred renders, _or_ ray tracing. No, they are typically radiosity renderers: http://en.wikipedia.org/wiki/Radiosity

Put simply is that ray-tracing is a per-screen pixel ray which is quite good at glossy surfaces with predictable lighting and plastic appearance. Radiosity, on the other hand, simulates actual light photons/waves. It is slower and more computational intensive, but it produces far more realistic results. In particular, it is good at lighting/shadowing, and is more directly applicable to rendering non-plastic surfaces, including sub-surface scattering (like flesh or hair).

All that said, REAL modern renderers, are wacky hybrid of every technique :-P


You're right, I forgot about the radiosity - but couldn't that still be split up to multiple cores per frame? i.e. simulate one "bundle" of photons per core, and then presuming that you're not too badly i/o bound, you should be able to do some sort of map/reduce. Each would generate a separate scene of light, and you would add them all together to make the final frame.


Yes, it can made data parallel, but not nearly as easily as ray-tracing. Ray tracing is easily parallelized by dividing the screen into sub renders. Each box renders a different NxM segment of the final image. This approach doesn't work for radiosity because each patch is dependant on the computations of other patches.

The primary trick used to parallelize radiosity is to partition the scene spatially. Treat the virtual polygons which separate the rooms as light absorbers. When the room is finished with all of the available light, send the light absorber across the bus as light emitters to the other processes rendering the other rooms. Iterate back and forth until the light absorbers absorb under a threshold of light.


It's not quite 960 cores. There are 960 main ALUs across the system, but they're 8-way per item you'd maybe call a core (the stream multiprocessor) and run the same program instruction per clock. There are multiple partitions to execution on top of that, too: 3 cores share a memory space, 10 to a chip, each chip with access to board memory, three chips in the complete Tesla system.

Those 960 ALUs are just the main FP32 and INT32 hardware, too. There are others (FP32 MUL and special function units, and an FP64 ALU too, per core).


This is probably very cool for people doing CG work. It's still too limited for most scientific work though because only single precision floating point is supported. If they'd make a version of this that supports double precision arithmetic and support it in the BLAS library so that parallelization of matrix operations occurs automatically without having to change existing MATLAB code, these things will sell like crazy in the scientific community.


CUDA now supports double precision and BLAS.


I'd love me 2 of those. Has anyone had experience with CUDA?


I haven't yet, but my summer / fall is going to be spent working with one of my profs to set up a lab with CUDA workstations and a Tesla. If you're interested in GPGPU programming, CUDA works with any system with a GeForce 8XXX or higher GPU. So ~$150 for a 9800GTX will get you 128 equivalent cores, which you can easily double at any time by running SLI.

What I'm interested in is seeing how OpenCL plays out, as Khronos seems to have the entire industry behind them (with the exception of Microsoft).


Yes.

It's pretty nice. As nice as C can get, that is... The biggest hurdle for me was thinking data-parallel as opposed to sequentially or even task-parallel.


I use CUDA for finance based Monte-Carlo simulations. Unless you are at a place with a huge cluster that you have access to, some problems cannot be solved without CUDA.


One downside to CUDA is its massively powerful for single FP precision, but double FP precision performance is less than 1/10 of single FP performance.

When precision is needed CUDA is much less useful, say you're running 10^10 simulations then with single FP precision you will only have a result accurate to 5 significant figures.


AFAIK some experimental versions of GHC can compile Haskell to GPUs, and you only lose a pretty small constant factor.


From the title, I was expecting something about the onboard computer of the acclaimed nerdmobile.


From what little I know of the subject, I can see this becoming the dominant platform for game servers, especially MMO games within a few years.


So what would be primary use of something like this?


Either a map-reduce type operation, or a job which can be split into multiple processes, and cobbling together all their data into a result at the end (or periodically).

Optical Character Recognition could be done, by for instance, recognising each letter on a different core.


Assuming you know where each letter starts and ends. So each word maybe.



Go to Technologies -> CUDA -> Cuda Applications and you'll find a screenful of applications.


How many FPS in Quake?

(just kidding)


New Folding@home super-cruncher?


Wow, a link I posted 2 days ago (also .co.uk) got frontpage'd. :-D

I'm gonna love one of these babies for my AI work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: