It really is good. Surprisingly it seems to answer instruct-like prompts well! I’ve been using it with Ollama (https://github.com/jmorganca/ollama) with prompts like:
ollama run phind-codellama "write c code to reverse a linked list"
To run this on an m1 Mac or similar machine, you'll need around 32GB of memory for the 4-bit quantized version since it's a 34B parameter model and is quite big (20GB).
Yes you'll need the latest version (0.0.16) to run the 34B model. It should run great on that machine!
The download (both as a Mac app and standalone binary) is available here: https://github.com/jmorganca/ollama/releases/tag/v0.0.16. And I will work on getting that brew formula updated as well! Sorry to see you hit an error!
Is there any chance you can add the phind-codellama-34b-v2 model? Or if there's already a way to run that with ollama, can you tell me how or point me to some docs? Thanks a ton!
This feature was surprisingly hard to find, but you can wrap multi line in “””. So start off with “””, then type or new line (with enter) anything you want, then close with “”” and it’ll run the whole prompt. (“”” is 3 parenthesis quote, the formatting from HN is making it look funny)
This project is sweet and I fortunately have a Mac M1 laptop with 64GB RAM to play with it.
But I'd like to also be able to run these models on my Linux desktop with two GPU's (a 2080Ti and a 3080Ti) and a Threadripper. How difficult would it be to set some of these up on there?
I personally use llama.cpp as the driver since I run CPU-only but another may be better suited for GPU usage. But then it's as simple as downloading the model and placing it in the directory.
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects
Command '. "/Users/pmarreck/Downloads/oobabooga_macos/installer_files/conda/etc/profile.d/conda.sh" && conda activate "/Users/pmarreck/Downloads/oobabooga_macos/installer_files/env" && python -m pip install -r requirements.txt --upgrade' failed with exit status code '1'. Exiting...
My Threadripper has 64 cores/128 threads and I'm wondering if any models can take advantage of CPU concurrency to at least mitigate some of the loss from not using a GPU (should one not be using a GPU)
From my experience llama.cpp doesn’t take full advantage of parallelism as it could. Tested this on an HPC cluster - increasing thread count certainly did increase CPU usage but did not meaningfully improve tok/s past 6-8 cores. Same behavior with whisper.cpp. :( I wonder if there’s another backend that scales better.
I'm guessing the problem is that you're constrained by memory bandwidth and not computation power and this is inherent to the algorithm, not an artifact of any one implementation.
The number of times I had to do it in real production code amounts to zero.
The number of times I had to piece code from poorly documented external services, conflicting product requirements, and running on complex and weird environments has been ... well ... multiple times a day for the past 20+ years.
"instruct-like prompts" is what you give a very junior engineer out of college, and then you have to carefully review their code.