"The fastest way to feel the magic is to run the speedrun script speedrun.sh, which trains and inferences the $100 tier of nanochat. On an 8XH100 node at $24/hr, this gives a total run time of about 4 hours."
I am clueless and don't understand this. Where is the $100 being spent? Some sort of API you have to pay to access? Some sort of virtual hardware you have to rent access to?
H100s are expensive NVIDIA GPUs, each costing about $30,000. 8XH100 means you have 8 of those wired together in a big server in a data center somewhere, so around a quarter of a million dollars worth of hardware in a single box.
You need that much hardware because each H100 provides 80GB of GPU-accessible RAM, but to train this model you need to hold a LOT of model weights and training data in memory at once. 80*8 = 640GB.
~$24/hour is how much it costs to rent that machine from various providers.
I am clueless and don't understand this. Where is the $100 being spent? Some sort of API you have to pay to access? Some sort of virtual hardware you have to rent access to?