Does anyone have a video or written article that would get one up to speed with a bit of the history/progression and current products that are out there for one to try locally?
This is coming from someone that understands the general concepts of how LLMs work but only used the general publicly available tools like ChatGPT, Claude, etc.
I want to see if I have any hardware I can stress and run something locally, but don’t know where to start or even what are the available options.
If I understand correctly what you are looking for, Ollama might be a solution (https://ollama.com/)?. I have no affiliation, but I lazily use this solution when I want to run a quick model locally.
To each their own, how are you using these extra features? I personally am not looking to spend a bunch on API credits and don't have the hardware to run models larger than 7-8b parameters. I use local llms almost exclusively for formatting notes and as a reading assistant/summarizer and therefor don't need these features.
Here's a summary of what's happened the past couple of years and what tools are out there.
After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.
Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.
At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].
Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.
A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.
Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.
At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.
However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.
The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.
Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.
Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.
With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.
Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.
More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.
Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.
If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.
If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.
I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.
> The company Mistral had proven itself in the past with very impressive LLaMA finetunes
Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.
> Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy
Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.
> I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.
Sure, I was only covering LLMs though. If I wanted to cover image generation models and tools as well, the comment would be double its size.
> Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.
Oh, that's right. Iirc it was just the Llama 2 architecture that was used with sliding window attention.
> Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.
I'm well aware of how quantization works. I meant quantization methods were increasingly able to retain accuracy. Such as methods which quantize less important weights more heavily, improving accuracy for the same LLM size.
All the main IDE-integrated ones seem very much on par (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh liked by some as it has code assistant-first UI.
I've personally went back to the browser with Claude 3.5 Sonnet (and the projects + artifacts feature), as it is one of the most industrious ones, and I really like the UX of artifacts + it integrates new code well into existing code you paste into it.
In the end I think it also often comes down to what languages/frameworks you are using and how well the LLM/product handles it, so I'd still recommend to test around. E.g. some of the main frameworks I'm working with on a daily basis went through big refactors/interface changes 1-2 years ago, and I stopped using ChatGPT because it had a strong tendency to produce code based on the old interfaces/paradigms.
Aider[0] is also quite interesting, especially when it comes to more significant refactorings in the codebase and has gotten quite good with that with the last few bigger model releases, but it takes same time to get used to and doesn't have good IDE-integration.
I've been following the state of things, but I'm not sure which ones are the best. There's Meta's CodeLlama[1], Mistral's Codestral[2], DeepSeek AI's DeepSeek-Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5], and Microsoft's WizardCoder[6].
I'm pretty sure CodeLlama is out of date now. I've heard DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was released recently. With the good reputation and its massive size (236b) I'd guess it is the best coding LLM, but if it's not being trained efficiently, maybe Codestral and Codestral Mamba come close.
I don't think the best coding LLMs are close to GitHub Copilot but I could be wrong since I'm just relaying information that I've heard secondhand.
>Most the 7b instruct models are very bad outside very simple queries.
I can't agree with "very bad". Maybe your standards are set by the best, largest models, but have a little perspective: a modern 7b model is a friggin magical piece of software. Fully in the realm of sci-fi until basically last Tuesday. It can reliably summarize documents, bash a 30 minute rambling voice note into a terse proposal, and give you social counseling at least on par with r/Relationship_Advice. It might not always get facts exactly right but it is smart in a way that computers have never been before. And for all this capability, you can get it running on a computer a decade old, maybe even a Raspberry Pi or a smartphone.
To answer the parent: Download a "gguf" file (blob of weights) of a popular model like Mistral from HugginFace. Git pull and compile llama.cpp. Run ./main -m path/to/gguf -p "prompt"
Even better, install ollama and then do "ollama run llama3", it works like docker, pulls the model locally and starts a chat session right there in the terminal. No need to compile. Or just run the docker image "ollama/ollama".
I'm looking to run something on a 24gb GPU for the purpose of running wild with agentic use of LLMs. Is there anything worth trying that would fit on that amount of vRAM? Or are all the open-source PC-sized LLMs laughable still?
You can run the llama 70b based models faster than 10 tkn/s on 24gb vram. I've found that the quality of this class of LLMs is heavily swayed by your configuration and system prompting and results may vary. This Reddit post seems to have some input on the topic:
You would probably get the same tokens per second with llama 3 70b if you just unplugged the 24gb GPU. For something that actually fits in 24gb of VRAM, I recommend gemma 2 27b up to q6. I use q4 and it works quite well for my needs.
Sadly, I've yet to find anyone doing "daily ML-hobbyist news" content creation, summarizing the types of articles that appear on these subreddits. (Which is a surprise to me, as it's really easy to find e.g. "daily homelab news" content creators. Please, someone, start a "daily ML-hobbyist news" blog/channel! Given that the target audience would essentially be "people who will get an itch to buy a better GPU soon", the CPM you'd earn on ad impressions would be really high...)
---
That being said, just to get you started, here's a few things to know at present about "what you can run locally":
1. Most models (of the architectures people care about today) will probably fit on a GPU which has something like 1.5x the VRAM of the model's parameter-weights size. So e.g. a "7B" (7 billion parameter-weights) model, will fit on a GPU that has 12GB of VRAM. (You can potentially squeeze even tighter if you have a machine with integrated graphics + dedicated GPU, and you're using the integrated graphics as graphics, leaving the GPU's VRAM free to only hold the model.)
2. There are models that come in all sorts of sizes. Many open-source ML models are huge (70B, 120B, 144B — things you'd need datacenter-class GPUs to run), but then versions of these same models get released which have been heavily cut down (pruned and/or quantized), to force them to fit into smaller VRAM sizes. There are 5B, 3B, 1B, even 0.5B models (although the last two are usually special-purpose models.)
3. Surprisingly, depending on your use-case, smaller models (or small quants of larger models) can "mostly" work perfectly well! They just have more edge-cases where something will send them off the rails spiralling into nonsense — so they're less reliable than their larger cousins. You might have to give them more prompting, and try regenerating their output from the same prompt several times, to get good results.
4. Apple Silicon Macs have a GPU and TPU that read from/write to the same unified memory that the CPU does. While this makes these devices slower for inference than "real" GPUs with dedicated VRAM, it means that if you happen to own a Mac with 16GB of RAM, then you own something that can run 7B models. AS Macs are, oddly enough, the "cheapest" things you can buy in terms of model-capacity-per-dollar. (Unlike a "real" GPU, they won't be especially quick and won't have any capacity for concurrent model inference, so you'd never use one as a server backing an Inference-as-a-Service business. But for home use? No real downsides.)
This is coming from someone that understands the general concepts of how LLMs work but only used the general publicly available tools like ChatGPT, Claude, etc.
I want to see if I have any hardware I can stress and run something locally, but don’t know where to start or even what are the available options.