I gave an example of why people wouldn't be good at it with the pipelined asynchronous memory copies. Take a look at link below to the documentation. It's just plain difficult to do something as basic as move data into shared memory efficiently. Others have given far more detailed responses.
You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.
You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....