Some popular ML projects ship with hand optimized CUDA kernels for performance. ...

l33tman · on Oct 11, 2023

Yeah I write CUDA kernels for my own projects as well. There won't be a shift if someone launches something that is 10% better, but if someone launches something that is twice the performance for the same price the companies with huge GPU-bills will pay to port those kernels pretty fast.

The kernels I wrote I hand-optimized using Nsight for a specific usecase, and sure they saved me like 200% compared to pure pyTorch and it took a few days, but when your company spends $10M for a single training run, you have quite many of "my days" to spend if another HW platform would save you $5M (but you need to have the calendar time as well to spend, which is another factor in this space that moves crazily fast).

If/when the field stabilizes a bit people will start looking at that bottom line I think but right now everybody just "has" to get those H100 cards before someone else gets them I guess...