More

maleadt · on April 16, 2021

Here's a screenshot: https://julialang.org/assets/blog/nvvp.png. Or a recent PR when you can see NVTX ranges from Julia: https://github.com/JuliaGPU/CUDA.jl/pull/760

jjoonathan · on April 16, 2021

Thanks! Now I believe! :)

maleadt · on April 16, 2021

Yeah, see this section of the documentation: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/. CUDA.jl also supports NVTX, wraps CUPTI, etc. The full extent of the APIs and tools is available.

Source line association when using PC sampling is currently broken due to a bug in the NVIDIA drivers though (segfaulting when parsing the PTX debug info emitted by LLVM), but I'm told that may be fixed in the next driver.

jjoonathan · on April 16, 2021

Nice! I set a reminder to check back in a month.

maleadt · on Feb 27, 2021

Impressive! The PTX to SPIR-V compiler must have been quite a bit of work; what's the coverage of the ISA like?

With oneAPI I had hoped to get the inverse, a oneAPI implementation for NVIDIA hardware, but I don't think the CUDA driver API is low-level enough to do so (e.g. explicit vs global contexts). And yes, I know of Codeplay's implementation of DPC++ for NVIDIA GPUs, but that doesn't implement oneAPI Level0 APIs so is not usable for other languages.

rodburns · on March 2, 2021

I'm from Codeplay and I am not sure about what you mean from your comment about Level0. Intel have developed a back-end for DPC++ that supports Intel processors through OpenCL and SPIR-V. Codeplay has implemented a back-end that supports Nvidia processors by using PTX instructions in the same way that native CUDA code does. PTX is the equivalent to SPIR for Nvidia processors. Maybe I am misunderstanding so apologies if that is the case.

maleadt · on Oct 20, 2020

Yes, that's fair. I focused on CUDA.jl because it is the most mature, easiest to install, etc. but as I mentioned we're actively working on generalizing that support as much as possible, and as a result support for AMD (AMDGPU.jl) and Intel (oneAPI.jl) GPUs is rapidly catching up.

soganess · on Oct 20, 2020

This is a complete novice, ill informed, question. So forgive it in advanced, but why have an AMD specific backend at all? Couldn't you just use AMD's HIP/HIP-IFY tool on the CUDA backend and get an AMD friendly version out?

https://github.com/ROCm-Developer-Tools/HIP

I realize these sort of tools aren't magic and whatever it spites out will need work, but it seems like a really good thin starting place for AMD support with a lower overhead for growth.

After the original CUDA bits can ""cross-compile"", the workflow is greatly reduced, right?

Workflow:

- update CUDA code

- push through the HIPIFY tool

- Fix what is broken (if you can fix it on the CUDA side)

After enough iterations, the CUDA code will grow friendly to HIPification...

jpsamaroo · on Oct 20, 2020

> This is a complete novice, ill informed, question. So forgive it in advanced, but why have an AMD specific backend at all? Couldn't you just use AMD's HIP/HIP-IFY tool on the CUDA backend and get an AMD friendly version out?

HIP and HIPify only work on C++ source code, via a Perl script. Since we start with plain Julia code, and we already have LLVM integrated into Julia's compiler, it's easiest to just change the LLVM "target" from Native to AMDGPU (or NVPTX in CUDA.jl's case) to get native machine code, while preserving Julia's semantics for the most part.

Also, interfacing to ROCR (AMD's implementation of the Heterogeneous System Architecture or HSA runtime) was really easy when I first started on this, and codegen through Julia's compiler and LLVM is trivial when you have CUDAnative.jl (CUDA.jl's predecessor) to look at :)

I should also mention that not everything that CUDA does maps well to AMD GPU; CUDA's streams are generally in-order (blocking), whereas AMD's queues are non-blocking unless barriers are scheduled. Also, things like hostcall (calling a CPU function from the GPU) doesn't have an obvious alternative with CUDA.

soganess · on Oct 21, 2020

Thank you for taking the time! I found this quite helpful.

MayeulC · on Oct 22, 2020

Something that is hinted at, but not spelled out loud in our posts is that AMD actively upstreams and maintains a LLVM back-end for their GPUs, so it really is a matter of switching the binary target for the generated code, at least in theory :)

maleadt · on Oct 20, 2020

Since we use fat array objects, and not raw pointers, we know the size of the array and can perform bounds checks at run time. We then have a mechanism to throw an exception and signal it to the CPU to display it there. That's obviously quite expensive, so you can disable it with that annotation (the Julia debug setting also controls the granularity, and thus how expensive the exception handling is). It's fairly primitive, i.e. no full-featured exception handling (for now), but has proven very useful already.

Athas · on Oct 20, 2020

How do you terminate the CUDA kernel when a bounds violation is encountered by a single thread? I don't think the CUDA API exposes a mechanism to do that safely.

maleadt · on Oct 20, 2020

You can emit `trap` or `exit` in the PTX code (although that has exposed many bugs in the PTX assembler because it does not expect that kind of often divergent control flow). But even if you'd just have the kernel return and otherwise produce invalid results, the fact that you can then report a bounds error instead of silently corrupting data and/or generating a fatal ERROR_ILLEGAL_ACCESS (requiring an application restart) is a significant usability improvement.

maleadt · on Oct 20, 2020

The Julia array abstractions make it so that most code is vendor-neutral already, and you execute on whatever GPU back-end you want by using an appropriate array type. For vendor-neutral kernel programming there's GPUArrays.jl and KernelAbstractions.jl, but both aren't currently very user friendly (but are actively used as a building block for user-facing applications and APIs).

maleadt · on April 17, 2020

Great talk! Any thoughts on Intel's oneAPI?

raphlinus · on April 17, 2020

Thanks. First I've heard of it, in fact, so I don't have any thoughts on it. It's interesting though, and would definitely be worth adding to a list of attempts to build portable layers.

maleadt · on April 17, 2020

I'm looking at targeting it from Julia, and the lower-level (Level Zero) API seems rather nice, resembling the CUDA driver API but building on SPIR-V. It's also nice how the API is decoupled from the implementation, so let's hope more vendors implement it (apparently Codeplay is working on a CUDA-based implementation).

maleadt · on April 17, 2020

That link should probably have been https://juliacomputing.com/industries/gpus.html, but that's rather old content. A better overview is https://juliagpu.org/, and you can find a demo of Julia's CUDA kernel programming capabilities here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

raphlinus · on April 17, 2020

I've changed the link, thanks!

maleadt · on Oct 28, 2019

Our view is that to get performance out of a system (here CUDA), it's better not to start abstracting it right away. So we have CUDAnative.jl and CUDAdrv.jl for fairly low-level CUDA programming, albeit in a high-level language. However, with CuArrays.jl we implement the Julia array interface for CUDA GPUs. That means you can write array code for one platform (CPU using Base.Array) and start using hardware accelerators by just switching the array type (CUDA GPU using CuArray). Of course, real-life applications might still need to use CUDA specific functionality for one reason or another, but at least you can get most of the way without platform-specific programming.

maleadt · on Oct 28, 2019

Author here, happy to answer any questions! We've been developing and maintaining this toolchain for a while now, so the relevant packages (CUDAnative.jl for kernel programming, CuArrays.jl for a GPU array abstraction) are much more mature. Our focus has recently been on implementing a common base of array operations that can be used across devices (GPU, CPU, etc), so that users can develop using the base CPU array type, quickly benefit from a GPU by switching to CuArrays, only to rely on specific CUDA-specific functionality from CuArrays/CUDAnative when they need custom functionality.