Pardon me if this is a dumb question, but is it possible for me to download thes...

simpaticoder · on July 18, 2024

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile

wkat4242 · on July 18, 2024

I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

simpaticoder · on July 18, 2024

This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.

rustcleaner · on July 18, 2024

There's no good reason for consumer nvidia cards to lack SODIMM-like slots for video RAM, except to rake in big bucks and induce more hasty planned obsolescence.

timschmidt · on July 19, 2024

DIMM slots won't work for GPU VRAM due to the higher speeds, tighter signalling, and dense packing of memory on wide buses. Take a look at the speeds DDR5 is running at in a typical Xeon server, and compare to GDDR6. This is the problem LPCAMM2 was developed to solve for modern x86 CPUs in laptops and desktops. Seeing it applied to GPUs would be great.

illusive4080 · on July 18, 2024

I love that domain name.

bezbac · on July 18, 2024

AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md

codetrotter · on July 18, 2024

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

Patrick_Devine · on July 18, 2024

We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

hedgehog · on July 18, 2024

Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.

andrethegiant · on July 18, 2024

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

HanClinto · on July 18, 2024

Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.

HanClinto · on July 18, 2024

Issue to track Mistral NeMo support in llama.cpp: https://github.com/ggerganov/llama.cpp/issues/8577

codetrotter · on July 18, 2024

> I'm regularly impressed at how quickly both projects get updated to support such things.

Same! Big kudos to all involved

RockyMcNuts · on July 18, 2024

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

Raed667 · on July 18, 2024

First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo

Patrick_Devine · on July 18, 2024

We're working on it!

Raed667 · on July 19, 2024

I'd love to read about what it means to add model on your end? Do you have some blog post or a TLDR list somewhere ?

nostromo · on July 18, 2024

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

homarp · on July 18, 2024

llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings

d13 · on July 18, 2024

Try Lm Studio or Ollama. Load up the model, and there you go.

kanwisher · on July 18, 2024

llama.cpp or ollama both have apis for most models