Hacker News new | past | comments | ask | show | jobs | submit login

Very unfriendly.

The symbolic library (type of activations) requires a branching at the very core of the kernel. GPU will need to serialized on these operations warp-wise.

To optimize, you might want to do a scan operation beforehand and dispatch to activation funcs in a warp specialized way, this, however, makes the global memory read/write non-coalesced.

You then may sort the input based on type of activations and store it in that order, this makes the gmem IO coalesced but requires gather and scatter as pre and post processing.




Wouldn't it be faster to calculate every function type and then just multiply them by 0s or 1s to keep the active ones?


That's pretty much how branching on GPUs already works.


couldn't you implement these as a texture lookup, where x is the input and the various functions are stacked in y? That should be quite fast on gpus.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: