Strange. I remember trying to get this to work on a 16gb machine and all of the ...

trailbits · on Dec 17, 2023

Try llamafile https://github.com/Mozilla-Ocho/llamafile I have Mistral 7B running this way on a 10 year old laptop and it only seems to use a few GB with it's memory mapping approach.

Filligree · on Dec 17, 2023

It doesn’t count most of the model, since it’s memory; it only shows up as memory used by the disk cache.

Though if your machine can’t keep it all in memory, then speed will still fall off a cliff.

trailbits · on Dec 17, 2023

If it is lazy loading just what it needs, seems like an efficient use of memory. In any case, this 4GB model will easily fit into the commenter's 16GB machine.

astrange · on Dec 17, 2023

If you're running on GPU then it would need to be wired, and wired file-backed pages do count as process memory and have to physically fit in DRAM.

heyoni · on Dec 17, 2023

Wow that's incredible. And legit too. I was reading through issues on llama-cpp about implementing memory swapping so I didn't think it had been done.

Thanks!

Filligree · on Dec 17, 2023

It’s really just a difference in accounting. Memory used for memory-mapped files aren’t shown in the “used” header, but instead the disk cache one. And doesn’t need to be swapped out to be discarded, so if you lack the memory it just slows everything down without an obvious cause.

RandomWorker · on Dec 17, 2023

I've got a 3.xBIL model running on a 8GB version which is fine-tuned to be better than 13BIL models; heres how I did it: https://christiaanse.ca/posts/running_llm/

heyoni · on Dec 17, 2023

That's not memory swapping but something else right? I ask because it looks like the new mistral model but slightly different.

RandomWorker · on Dec 18, 2023

It’s fine tuned minstrel model based on orca Microsoft model.