Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.


We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D


Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.


Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?


Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.


Issue to track Mistral NeMo support in llama.cpp: https://github.com/ggerganov/llama.cpp/issues/8577


> I'm regularly impressed at how quickly both projects get updated to support such things.

Same! Big kudos to all involved




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: