Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.
For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.
Never tried those, so I couldn't say. I guess it would.
Even so, creating all the abstractions needed to implement even regular matrix multiplication in Spiral in a generic fashion took me two months, so I'd consider that good enough exercise.
You could do it a lot faster by specializing for specific matrix sizes, like in the Cuda examples repo by Nvidia, but then you'd miss the opportunity to do the tensor magic that I did in the playlist.
NNs for example are (mostly) a sequence of matrix multiplication operations, and GPUs are very good at those. Much better than CPUs. AI is hot at the moment, and Nvidia is producing the kind of hardware that can run large models efficiently which is why it's a 2 trillion-dollar company right now.
However, in the Spiral series, I aim to go beyond just making an ML library for running NN models and break new ground.
Newer GPUs actually support dynamic memory allocation, recursion, and the GPU threads have their own stacks, so you could in fact treat them as sequential devices and write games and simulators directly on them. I think once I finish the NL Holdem game, I'll be able to get over 100x fold improvements by running the whole program on the GPU versus the old approach of writing the sequential part on a CPU and only using the GPU to accelerate a NN model powering the computer agents.
I am not sure if this is a good answer, but this is how GPU programming would be helpful to me. It all comes down to performance.
The problem with programming them is that the program you are trying to speed up needs to be specially structured, so it utilizes the full capacity of the device.