Hacker News new | past | comments | ask | show | jobs | submit login

Seeing those gigantic models it makes me sad that even the 4090 is supposed to stay at 24GB of RAM max. I really would like to be able to run/experiment on larger models at home.



It's also a power issue. The 4090 sounds like you're going to need a much, MUCH higher PSU than you currently use.. or it'll suddenly turn off as it uses 2-3x the power.

You'll need your own wiring to run your PC soon :-)


I think it is a stupid question, but does the power consumption needed by processors to infer compared to human brains demonstrate that there is something fundamentally wrong for the AI approach or is it more physics related?

I am not a physicist or biologist or anything like that so my intuition is probably completely wrong but it seems to me that for more basic inference operations (lets say add two numbers) power consumption from a processor and a brain is not that different. It’s like seeing how expensive it is for computers to infer for any NLP model, humans should be continuously eating carbs just to talk.


Around room temperature, an ideal silicon transistor has a 60 mV/decade subthreshold swing, which (roughly speaking) means that a 10-fold increase in current requires at least a 60 mV increase in gate potential. There are some techniques (e.g. tunneling) that can allow you to get a bit below this, but it's a fairly fundamental limitation of transistors' efficiency.

[It's been quite a while since I studied this stuff, so I can't recall whether 60 mV/decade is a constant for silicon specifically or all semiconductors.]


> but it seems to me that for more basic inference operations (lets say add two numbers) power consumption from a processor and a brain is not that different

Sure it is - it is too hard to figure it out based on 2 numbers number, but lets multiply that by a billion - how much energy does it take a computer to add two billion numbers? Far less than the energy it would take a human brain to add them.


The AI is much faster than the brain, if you batch requests the cost goes down.


I bought a 1500w psu soon after the previous crypto collapse for around $150, one of the best purchases I did.


The RAM is not using all that much of the power, and I think that scales more on bus width than capacity.


Nvidia deliberately keeps their consumer/gamer cards limited in memory. If you have a use for more RAM, they want you to buy their workstation offerings like RTX A6000 which has 48G DDR6 RAM or A100 which has 80G.


What NVIDIA predominantly does on their consumer cards is limit the RAM sharing, not the RAM itself. The inability for each GPU to share RAM is the limiting factor. It is why I have RTX A5000 GPUs and not RTX 3090 GPUs.


If you don't care about inference speed being in the 1-5sec range, then that should be doable with CPU offloading, with e.g. DeepSpeed.


200+ GiB of RAM still sounds like a pretty steep hardware requirement.


If you have an nvme deepspeed can offload there as a second tier once the RAM is full.

175 GB aggregate on both RAM and nvme is in the realm of home deep learning workstation.

As long as you aren’t too fussy about inference speed of course.


Oh yeah, that $750 for 256GB of DDR-4 is going to totally break the bank.


Damn I didn't know ram was so cheap


It only gets expensive if you insist on sourcing it from enterprise vendors. The first 256GB I paid $2,400 for. The second 256GB I paid $1,200 a little over a year later. And the third 256GB I paid $800 about seven months later. I've got a workstation with 768GB DDR4 and I am considering upping that to 1.5TB if the prices on the 256GB sticks will come down.


For the people that didn't click on the link:

>but is able to work with different configurations with ≈200GB of GPU memory in total which divide weight dimensions correctly (e.g. 16, 64, 128).


Take a look at Apple's M1 Max, a lot of fast unified memory. No idea how useful though


What's the difference between Apple's unified memory and the shared memory pool Intel and AMD integrated GPUs have had for years?

In theory you could probably assign a powerful enough iGPU a few hundred gigabytes of memory already, but just like Apple Silicon the integrated GPU isn't exactly very powerful. The difference between the M1 iGPU and the AMD 5700G is less than 10% and a loaded out system should theoretically be tweakable to dedicate hundreds of gigabytes of VRAM to it.

It's just a waste of space. An RTX3090 is 6 to 7 times faster than even the M1, and the promised performance increase of about 35% for the M2 will means nothing when the 4090 will be released this year.

I think there are better solutions for this. Leveraging the high throughput of PCIe 5 and resizable BAR support might be used to quickly swap out banks of GPU memory, for example, at a performance decrease.

One big problem with this is that GPU manufacturers have incentive to not implement ways for consumers GPUs to compete with their datacenter products. If a 3080 with some memory tricks can approach an A800 well enough, Nvidia might let a lot of profit slip through their hands and they can't have that.

Maybe Apple's tensor chip will be able to provide a performance boost here, but it's stuck on working with macOS and the implementations all seem proprietary so I don't think cross platform researchers will really care about using it. You're restricted by Apple's memory limitations anyway, it's not like you can upgrade their hardware.


Apple gets significant latency and frequency benefits from placing their LPDDR4 on the SoC itself.


Unified memory is and always has been a cost cutting tactic. Its not a feature not matter how much manufacturers who use it try to claim it is.


Apple is selling M1's with > 200gb ram? Have a link so I can buy one?


Wondering if Apple Silicon will bring arge amounts of unified main memory with high bandwidth to the masses?

The Mac Studio maxes out at 128GB currently for around $5K, so 256GB isn't that far out and might work with the ~200GB Yandex says is required.


Perhaps on quantity. Substantially slower though around ~3x from what I can tell…substantial roadblock if you’re training models that take weeks.


I meant for inference, not training. People just want to run the magic genies locally and post funny AI content.


ah right - gotcha


Can Apple Silicone's unified memory be an answer?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: