Hacker News new | past | comments | ask | show | jobs | submit login

That's true for conventional fine-tuning, but is it the case for parameter efficient fine tuning and qLORA? My understanding is that for a N billion parameter model, fine tuning can occur with a slightly-less-than-N gigabyte of VRAM GPU.

For that 70B parameter model: an A100?




2x 40/48GB GPUs would be the cheapest. But that's still a very expensive system, especially if you don't have a beefy workstation with 2x PCIe slots just lying around.


Even mATX boards tend to come with two (full-length) PCIe slots, and that's easy sub-$1k territory. Not exactly a beefy workstation.

Source: have a $200 board in my computer right now with two full-length PCIe slots.


Whats more difficult is trying to cool gpus with 24-48gb of RAM… they all seem to be passively cooled


Good point, I think most of them are designed for a high-airflow server chassis, with airflow in a direction that a desktop case wouldn't necessarily facilitate (parallel to the card).


Waterblocks exist for some compute-only GPUs, including the Nvidia A100. Also, there are a few small vendors in China that offer mounting kits that allow you to mod these compute-only GPUs to use off-the-shelf AIO watercoolers. Certainly, not many people are going to take the risk to modify the expensive Nvidia A100, but these solutions are moderately popular among the DIY home lab developers to convert older server cards for home workstation use. Decommissioned Nvidia Tesla P100 or V100 can be purchased cheaply for several hundreds dollars.


> Decommissioned Nvidia Tesla P100 or V100 can be purchased cheaply for several hundreds dollars.

Meh. If you want 16GB of VRAM for several hundred dollars, can't you just pull a brand new 30-series off the shelf and have ten times more computing power than those old pascal cards? You'll even have more VRAM if you go for the 3080 or 3090. Admittedly, the 3090 is closer to $700 or so, but it should still make a P100 very sad in comparison.


Yeah, these GPUs became less appealing after the prices of 30-series GPUs have dropped. The price of SXM cards are still somewhat unbeatable though if you have a compatible server motherboard [1]. Nvidia P100s are being sold for as low as $100 each, there are similar savings for the Nvidia V100s. But yeah, a saving around $100 to $200 is not really worthwhile...

Another curious contender is the decommissioned Nvidia CMP series GPUs from miners. For example, the Nvidia CMP 170HX basically uses the same Nvidia A100 PCB with its features downsized or disabled (8 GB VRAM, halved shaders, etc). But interestingly, it seems to preserve the full 1500 GB/s memory bandwidth, making it potentially an interesting card for running memory-bound simulations.

[1] Prices are so low exactly because most people don't. SXM-to-PCIe adapters also exist which cost $100-$200 - nearly as much as you have saved. It should be trivial to reverse-engineer the pinout to make a free and open source version.


I didn't know the CMP had full bandwidth. that would be a an excellent card for smallish networks (like stable diffusion, GANs, audio networks)

...But it doesn't seem to be cheap. Not really worth it over a 4090 for the same price.


It seems that the CMP 170HX is being sold for $500 +/- $100 on the flea markets in China as closed mining farms are dumping any remaining inventory. Not sure if the prices are real, I'm currently trying to purchase some.


I can confirm that the price is real ;-)


Is it possible to take something like a CMP 170HX and do board-level work to add more memory chips? Or are they not connected to silicon?


I don't believe it's possible. The HBM2e chips are integrated onto the package of the GPU die, making them impossible to remove or modify in a non-destructive manner.


The Quadros/Firepros have blower coolers.


Not with full x16/x16, though I suppose you don't necessarily need that.


Of course, usually the other PCIe slots are something stupid, but there's still a second full-length one, so this could potentially fit two GPUs with the right power supply.


If one is training the full 70B parameters, then the total memory usage far exceeds the memory for simply storing the 70B parameters (think derivatives and optimizer parameters such as momentum.) This is the main reason why models are split or why techniques like the fully distributed data sharing are used during training. During training of a distributed model, at every step of the optimizer these multiple-of-70B parameters need to go through a network wire (though not to all nodes, thankfully). As you suggested, LoRA could work well in a distributed setting because the trainable parameters are very small in number (tens of thousand of times less trainable parameters) and the info required to go through the network for non trainable parameters is also small. However, training this model on a single A100 is impractical as it would require mimicking a distributed training buffering things on a TB-sized CPU RAM (or slower) to swap pieces in and out of the model during every step in an otherwise distributed operation (and is not natively supported in existing frameworks to the best of my knowledge, even though one could technically write this code without too much difficulty.)


I think you'd need 2 80GB A100's for unquantised.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: