Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The sweet spot for running local LLMs (from what I'm seeing on forums like r/localLlama) is 2 to 4 3090s each with 24GB of VRAM. NVidia (or AMD or Intel) would clean up if they offered a card with 3090 level performance but with 64GB of VRAM. Doesn't have to be the leading edge GPU, just a decent GPU with lots of VRAM. This is kind of what Digits will be (though the memory bandwidth is going to be slower with because it'll be DDR5) and kind of what AMD's Strix Halo is aiming for - unified memory systems where the CPU & GPU have access to the same large pool of memory.


The issue here is that, even with a lot of VRAM, you may be able to run the model, but with a large context, it will still be too slow. (For example, running LLaMA 70B with a 30k+ context prompt takes minutes to process.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: