Some popular ML projects ship with hand optimized CUDA kernels for performance. It is, unfortunately, the standard.
If you get around that, sometimes projects will use dependencies that include an odd CUDA-only library. I ran into this a lot with GANs (which often use some weird loss metric with a random hard CUDA dependency) or media libs.
Of course there are plenty of solutions, and some are theoretically transparent. But the issue is getting people to use them. Academic ML researchers in particular seem incredibly impatient, looking for the absolute easiest way to prototype something (which usually means coding what they already know). Ease of installation on a wide range of HW is about their last priority.
Yeah I write CUDA kernels for my own projects as well. There won't be a shift if someone launches something that is 10% better, but if someone launches something that is twice the performance for the same price the companies with huge GPU-bills will pay to port those kernels pretty fast.
The kernels I wrote I hand-optimized using Nsight for a specific usecase, and sure they saved me like 200% compared to pure pyTorch and it took a few days, but when your company spends $10M for a single training run, you have quite many of "my days" to spend if another HW platform would save you $5M (but you need to have the calendar time as well to spend, which is another factor in this space that moves crazily fast).
If/when the field stabilizes a bit people will start looking at that bottom line I think but right now everybody just "has" to get those H100 cards before someone else gets them I guess...
If you get around that, sometimes projects will use dependencies that include an odd CUDA-only library. I ran into this a lot with GANs (which often use some weird loss metric with a random hard CUDA dependency) or media libs.
Of course there are plenty of solutions, and some are theoretically transparent. But the issue is getting people to use them. Academic ML researchers in particular seem incredibly impatient, looking for the absolute easiest way to prototype something (which usually means coding what they already know). Ease of installation on a wide range of HW is about their last priority.