Since I only have one GPU usually, I'm still waiting for the day I can do this:
using CUDA
# define a kernel
@kernel function kernel_vadd(a, b, c)
i = blockId_x() + (threadId_x()-1) * numBlocks_x()
c[i] = a[i] + b[i]
end
# create some data
dims = (3, 4)
a = round(rand(Float32, dims) * 100)
b = round(rand(Float32, dims) * 100)
c = Array(Float32, dims)
# execute!
@cuda kernel_vadd(CuIn(a), CuIn(b), CuOut(c))
# verify
@show a+b == c
i.e. no set up, no "cuda context" (whatever that is), and no tear down afterwards. I understand that manual memory management is almost necessary with this application, but it seems that most of it could be automated in the most common case of "I have a couple of large arrays and a few operations I want to perform."
Torch7 has amazing GPU support. If you wanna send a tensor (array) to the GPU, you just type array:cuda(). All operations you do from now on will be on the GPU.
I have seen a couple of Clojure based GPU compilers, although they seemed to be sort of proof-of-concepts, and I'm not sure how general purpose they were. It would be nice to possibly get out of the mode of it being a whole other computer and compiler, but maybe that distinction is for the best for the time being.
It is definitely possible to do that in Julia. The reason a didn't yet is purely a manner of priorities, I first focused on wrapping the basic primitives (calling a kernel, marshalling arguments, etc) in a user-friendly way.
Have you tried thrust? For a limited but very useful set of operations, it is nearly that painless, as long as you don't mind a generous helping of C++ template boilerplate.
The content of the project aside (which is very exciting but early stage), I'm really impressed with the author's overview for how others can take up the project and move it forward. I've seen too many projects die because when the authors move onto other things they just leave an incomplete git repo and no clear plan for what happens next. It'd be great of course if there were someone lined up to take the reins, but the crucial thing is that in this case someone could ascertain the project's state and possible next steps even months after maleadt is out of the picture.
Thanks for the kind words! This was exactly what I was aiming for: the code (or insights) to be reusable without too much hassle. More so because part of it was developed in the scope of my PhD; I wouldn't want to know how many failed or unpublishable research results are stowed away on some grad student's computer.
When working in julia, what are the benefits of tying oneself to CUDA (and not running accelerated on on-die graphics or on amd gpus) -- or doesn't nvidia work reliably/well with opencl?
OpenCL.jl is purely the runtime part, ie. it still requires you to write manual OpenCL code, after which you can use the julia wrapper to manage that code.
My project also provides compiler support for lowering Julia to CUDA assembly, so you don't need to write CUDA code yourself. Added to that, my runtime also contains (PoC) higher-level wrappers, making it easier to call CUDA kernels, upload data, etc.
Concerning tying yourself to the NVIDIA-stack: it's still the most mature and versatile toolchain, which is why I picked it in the first place. My long term plan was to switch over to SPIR (or some other cross-vendor stack) as soon as possible. At that point, switching user-code over to that new back-end would (theoretically) not require that much effort, since the kernels are written in julia-code instead of CUDA C (except for the runtime interactions, of course).