Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!



Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile


I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.


This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.


There's no good reason for consumer nvidia cards to lack SODIMM-like slots for video RAM, except to rake in big bucks and induce more hasty planned obsolescence.


DIMM slots won't work for GPU VRAM due to the higher speeds, tighter signalling, and dense packing of memory on wide buses. Take a look at the speeds DDR5 is running at in a typical Xeon server, and compare to GDDR6. This is the problem LPCAMM2 was developed to solve for modern x86 CPUs in laptops and desktops. Seeing it applied to GPUs would be great.


I love that domain name.


AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md


I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.


We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D


Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.


Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?


Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.


Issue to track Mistral NeMo support in llama.cpp: https://github.com/ggerganov/llama.cpp/issues/8577


> I'm regularly impressed at how quickly both projects get updated to support such things.

Same! Big kudos to all involved


You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air


First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo


We're working on it!


I'd love to read about what it means to add model on your end? Do you have some blog post or a TLDR list somewhere ?


Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.


llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings


Try Lm Studio or Ollama. Load up the model, and there you go.


llama.cpp or ollama both have apis for most models




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: