Fair warning that the dominant model in GPU compilers is to use an ISA that looks like the GPUs did long ago. Expect to see i32 used to represent a machine vector of ints where you might reasonably expect <32 x i32>. There are actual i32 scalar registers as well, which are cheap to branch on, that are also represented in IR as i32. That is, the same as the vector. There are intrinsics that sometimes distinguish.
This makes a rather spectacular mess of the tooling. Instead of localising the cuda semantics in clang, we scatter it throughout the entire compiler pipeline, where it does especially nasty things to register allocation and generally obstructs non-cuda programming models. It's remarkably difficult to persuade GPU people that this is a bad thing.
Also the GPU programming languages use very large compiler runtimes to do a degree of papering over the CPU-host GPU-target assumption that also dates from long ago, so expect to find a lot of complexity in multiple libraries acting somewhat like compiler-rt. Those are optional in reality but the compiler usually emits a lot of symbols that resolve to various vendor libraries.
This makes a rather spectacular mess of the tooling. Instead of localising the cuda semantics in clang, we scatter it throughout the entire compiler pipeline, where it does especially nasty things to register allocation and generally obstructs non-cuda programming models. It's remarkably difficult to persuade GPU people that this is a bad thing.
Also the GPU programming languages use very large compiler runtimes to do a degree of papering over the CPU-host GPU-target assumption that also dates from long ago, so expect to find a lot of complexity in multiple libraries acting somewhat like compiler-rt. Those are optional in reality but the compiler usually emits a lot of symbols that resolve to various vendor libraries.