Nvidia Opens CUDA Platform, Releases Compiler Source Code

melonakos · on Dec 14, 2011

IMO, open sourcing their GPU libraries would be a much bigger deal than only open sourcing the compiler. I would like to see CUBLAS, CUFFT, CUSPARSE, CURAND, etc all get opened up to the community.

The pain is not in compiling GPU code; rather, the pain is in writing good GPU code. The major difference between NVIDIA and AMD (and the major edge NVIDIA has over AMD) is not as much the compiler as it is the libraries.

Of course, I'm biased, because I work at AccelerEyes and we do GPU consulting with our freely available, but not open source, ArrayFire GPU library, which has both CUDA and OpenCL versions.

dxbydt · on Dec 14, 2011

> the pain is in writing good GPU code

A viable alternative is to not write the GPU code yourself. Write a code generator in Scala that spits out GPU code in C. For details see Claudio Rebbi's work, which uses Scala as a higher level code genarator for CUDA to solve the Dirac-Wilson equation on the lattice( http://wwwold.jlab.org/conferences/lattice2008/talks/poster/... ). In finance, we are actively looking at CUDA for derivative pricing problems in risk analytics. None of us wants to actually write GPU code in C, and we do have considerable amount of risk analytics work being done in Scala, so a code generator might actually be the way to go.

rbabich · on Dec 14, 2011

As an author of that paper, I can tell you that the code generator was rather simple and mainly used to perform loop unrolling, avoid explicit indexing, and replicate bits of code that couldn't quite be encapsulated in inline functions. It's possible to go further, but this sort of metaprogramming doesn't really eliminate the need to write in CUDA C.

For what it's worth, we long ago abandoned scala in favor of python for the code generator, just to make it more accessible to others interested in working on the project (generally particle physicists by training): http://lattice.github.com/quda/

Groxx · on Dec 14, 2011

And who writes good GPU code generators if the libraries are poorly understood and / or closed source? Certainly, generators are the way to go for a lot of uses, but not all, and someone still needs to write the generators.

melonakos · on Dec 14, 2011

Over the last 5 years, I've seen a ton of hot air blown about wrt to auto-GPU code generation. The latest hot air is about how magical directives make everything run fast.

Truth is, compilers and code generators are crappy.

If you really want to get good performance, you either have to write your own low-level GPU kernels, or use a library of functions that have already been written at a low-level.

All other hot air, while interesting, has yet to be proven at scale on more than a few limited use cases.

Another disclaimer: I work on this, http://accelereyes.com/arrayfire

sharpneli · on Dec 15, 2011

There are 2 parts in writing good GPU code, parallelizing the algorithm and writing the kernels. Automatization of one part will not save time on other.

Based on practical experience the compilers are pretty good nowadays. The fine details of the kernel do not matter that much. The performance issues tend to float around usage of local memory, bank conflicts and how much one kernel instance does work, which require hand tuning and in these cases the compilers are underperforming. Thankfully the poor kernels are 'just' constant factor in the general time complexity of the algorithm.

On higher level the most important thing is to describe the actual algorithm. If the algorithm is described as serial one there is no automated way (and most likely will not ever be general way) of parallelizing it, except running it to check data dependencies after which you already have the result, and the dependencies can change based on inputs so result of one run cannot be generalized.

This could probably be proved by similar method as with halting. The program calls the autoparallelizer and if the parallelizer says there is no data dependency between 2 parts it will make them dependent, if it says there is it will make them independent.

Thus let it be clear: There is no way whatsoever to take the hard parts away (thinking in parallel). Nothing will take bunch of serial code in and spit parallel programs out.

andrewcooke · on Dec 14, 2011

are you confusing syntax and semantics? there's a hurdle that you need to cross with writing cuda code because it's C-like and easy to make "stupid mistakes". a code generator would help you there. but the harder part is getting the algorithm correct (and optimal, for a range of sizes of data). a generator is not so much use there (except for polymorphism, where templating helps).

or am i missing something? how do you see code generators helping you get algorithms right?

wmf · on Dec 15, 2011

Code generators like ATLAS let you generate a thousand variations of the code and pick the fastest one.

melonakos · on Dec 14, 2011

Also, OpenCL is not going away, even if someone figured out how to get CUDA code to run well on ATI GPUs. OpenCL is gaining a lot of traction by mobile GPU vendors too (e.g. ARM Mali, Imagination PowerVR, Qualcomm Adreno, etc)

T_S_ · on Dec 14, 2011

I think you meant OpenCL is not going away, and I agree.

melonakos · on Dec 14, 2011

right :) i fixed it.

japaget · on Dec 14, 2011

The title of this post is slightly misleading. The actual article does not state that Nvidia has released the source code yet, but only that they are planning to do so in the near future. A signup form is provided so that you can be sent an e-mail when Nvidia actually does release the source code.

srean · on Dec 15, 2011

There have been few comments about using specialized code generators, for example Theano[1] written in Python and as mentioned in a comment quda. I do not have the background to understand them well, but I find them very interesting.

One question that I have is whether anyone has looked at adapting or using the IF2 backend of the Sisal programming language [2] for these. I ask because some of the optimization that Theano does reminds me of things that IF2 is supposed to be doing too. Sisal was written with the old school vector machines and supercomputers in mind but has a backend that depends only on the availability of pthreads. I suspect that it might be possible to add support for SSE and its ilk.

[1] http://deeplearning.net/software/theano/

[2] http://sourceforge.net/projects/sisal/

varelse · on Dec 14, 2011

This answers the #1 objection to using CUDA instead of OpenCL: vendor lock.

What it doesn't answer is who's going to write the compilers and if they will ever happen.

But it does prove NVIDIA is still a player in the many-core game and that there are still a few more rounds to go before there's a winner.

binarycrusader · on Dec 14, 2011

Key wording to observe here -- they said they'd release the source code, not that it would be under an open source license.

They're "opening the platform". We'll see what they actually do.

danieldk · on Dec 14, 2011

Unfortunately, it does not say what license will be used, which is probably relevant if they want to create an ecosystem around the compiler.

exDM69 · on Dec 14, 2011

I agree that the exact licensing terms are somewhat relevant if you intend to depend on this software.

However, it's worth noting that the compiler in question is LLVM based. So you can construct your own compiler frontend that generates LLVM IR code that can be compiled for CUDA by their backend. It's very likely that there are some CUDA-specific LLVM intrinsics, so the frontend will not be entirely independent of CUDA compiler licensing terms but at least now you have a somewhat open interchange format to use between your frontend and the CUDA backend.

DiabloD3 · on Dec 15, 2011

Until Mesa/Gallium implements a CUDA stack, I see no point in caring what Nvidia does or doesn't do with their source code.

And, most likely, CUDA will never be done by Mesa/Gallium unless quite a few people porting legacy CUDA get together and make it happen.

OpenCL is a multi-vendor supported actual standard, even Nvidia is part of the Khronos OpenCL group, slightly implying that even Nvidia has admitted defeat.

justincormack · on Dec 14, 2011

we just need documentation to understand what the generated code does then, as AFAIK the output is code for undocumented hardware.

sparky · on Dec 14, 2011

There's a good chance the LLVM backend will emit PTX, not machine code. PTX is well documented [1]. Under such a system, the generated PTX would be JITed at runtime by the driver.

Note that LLVM already has a (very experimental and not complete) PTX backend [2].

[1] http://developer.download.nvidia.com/compute/cuda/3_0/toolki...

[2] http://llvm.org/releases/3.0/docs/ReleaseNotes.html#whatsnew

paxswill · on Dec 14, 2011

I'm pretty sure this is the case by playing with the OpenCL side of CUDA. If the '--version' flag is passed to the OpenCL compiler (at least the one with CUDA 3.0), info from an LLVM build from a year ago is dumped. The '-cl-nv-verbose' flag is also documented to pass '--verbose' to the ptxas assembler.

oneofthose · on Dec 14, 2011

It is undocumented but you can get a fairly decent idea of what is going on if you have a good understanding of such architectures in general and from the sparse documentation they provide, if you run microbenchmarks and use tools such as decuda (https://github.com/laanwj/decuda/wiki).

Also people working with those devices are often scientists that are eager to share what they found out (if only to say "You're doing it wrong!"). See for example Vasily Volkov's work here http://www.cs.berkeley.edu/~volkov/

pbsd · on Dec 14, 2011

It's slightly better documented these days, ever since cuobjdump is bundled with the compiler tools. It allows SASS output, which is supposed to be the native machine code of the Fermis.

adrianscott · on Dec 14, 2011

This sounds very exciting! I guess it's not totally related, but I hope VLC Player will get better Nvidia hardware acceleration soon...!

ajross · on Dec 14, 2011

It's pretty much not related at all. VLC is a player UI client, it doesn't have codecs of its own. You should be wishing for better GPU acceleration in libavcodec if anything (but even that isn't implemented with CUDA).

keeperofdakeys · on Dec 14, 2011

VLC is more then a UI, they have to implement the decoders in libavcodec, and they do a lot of work to package things underneath. FFmpeg also supports VDPAU (the nvidia, linux video acceleration api), but it would still be some work for VLC to implement it.

mappu · on Dec 15, 2011

VLC does use VA-API on linux, though. I guess the rationale is that people with high-end AMD and nVidia GPUs are likely to have plenty of CPU horsepower, and acceleration is mostly needed for people with those intel IGPs that VA-API supports.

(EDIT: the real reason VA-API is used over VDPAU or XvBA is probably pragmatic and related to driver stability)

keeperofdakeys · on Dec 16, 2011

After having a look at VA-API vs VDPAU, I must say VDPAU is much nicer. VDPAU allows you to define times when frames will be shown, so vsync is handled fully in hardware; more than one transparent sub-picture can also be shown at one time.