Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.
EDIT: This a 1W light bulb moment for me, thank you!
Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile
I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.
Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.
This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.
There's no good reason for consumer nvidia cards to lack SODIMM-like slots for video RAM, except to rake in big bucks and induce more hasty planned obsolescence.
DIMM slots won't work for GPU VRAM due to the higher speeds, tighter signalling, and dense packing of memory on wide buses. Take a look at the speeds DDR5 is running at in a typical Xeon server, and compare to GDDR6. This is the problem LPCAMM2 was developed to solve for modern x86 CPUs in laptops and desktops. Seeing it applied to GPUs would be great.
I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.
We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.
Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D
Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.
Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.
Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.
So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.
It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.
EDIT: This a 1W light bulb moment for me, thank you!