Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

[I work on SCALE]

Mapping inline ptx to AMD machine code would indeed suck. Converting it to LLVM IR right at the start of compilation (when the initial IR is being generated) is much simpler, since it then gets "compiled forward" with the rest of the code. It's as if you wrote C++/intrinsics/whatever instead.

Note that nvcc accepts a different dialect of C++ from clang (and hence hipcc), so there is in fact more that separates CUDA from hip (at the language level) than just find/replace. We discuss this a little in [the manual](https://docs.scale-lang.com/manual/dialects/)

Handling differences between the atomic models is, indeed, "fun". But since CUDA is a programming language with documented semantics for its memory consistency (and so is PTX) it is entirely possible to arrange for the compiler to "play by NVIDIA's rules".



Huh. Inline assembly is strongly associated in my mind with writing things that can't be represented in LLVM IR, but in the specific case of PTX - you can only write things that ptxas understands, and that probably rules out wide classes of horrendous behaviour. Raw bytes being used for instructions and for data, ad hoc self modifying code and so forth.

I believe nvcc is roughly an antique clang build hacked out of all recognition. I remember it rejecting templates with 'I' as the type name and working when changing to 'T', nonsense like that. The HIP language probably corresponds pretty closely to clang's cuda implementation in terms of semantics (a lot of the control flow in clang treats them identically), but I don't believe an exact match to nvcc was considered particularly necessary for the clang -x cuda work.

The ptx to llvm IR approach is clever. I think upstream would be game for that, feel free to tag me on reviews if you want to get that divergence out of your local codebase.


I certainly would not attempt this feat with x86 `asm` blocks :D. PTX is indeed very pedestrian: it's more like IR than machine code, really. All the usual "machine-level craziness" that would otherwise make this impossible is just unrepresentable in PTX (though you do run into cases of "oopsie, AMD don't have hardware for this so we have to do something insane").


It's a beautiful answer to a deeply annoying language feature. I absolutely love it. Yes, inline asm containing PTX definitely should be burned off at the compiler front end, regardless of whether it ultimately codegens as PTX or something else.

I'm spawned a thread on the llvm board asking if anyone else wants that as a feature https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-feature... in the upstream. That doesn't feel great - you've done something clever in a proprietary compiler and I'm suggesting upstream reimplement it - so I hope that doesn't cause you any distress. AMD is relatively unlikely to greenlight me writing it so it's probably just more marketing unless other people are keen to parse asm in string literals.


nvcc is nowhere near that bad these days, it supports most C++ code directly (for example, I've written kernels that include headers like <span> or <algorithm> and they work just fine).


NVCC is doing much better than before in terms of "broken C++". There was indeed a time when lots of modern C++ just didn't work.

Nowadays the issues are more subtle and nasty. Subtle differences in overload resolution. Subtle differences in lambda handling. Enough to break code in "spicy" ways when you try to port it over.


What do you think the source of this is? My understanding was that Nvidia is basically adopting the clang frontend wholesale now so I'm curious where it differs.


The LLVM manual touches on some of the basics of why: https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-diffe...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: