Does anyone have a video or written article that would get one up to speed with ...

Kydlaw · on July 16, 2024

If I understand correctly what you are looking for, Ollama might be a solution (https://ollama.com/)?. I have no affiliation, but I lazily use this solution when I want to run a quick model locally.

TechDebtDevin · on July 16, 2024

Better yet install Open Web GUI and ollama at the same time via docker. Most people will want a familiar GUI rather than the terminal.

https://github.com/open-webui/open-webui

This will install ollama and open web GUI:

For GPU support run:

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Use for CPU only support:

docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Der_Einzige · on July 16, 2024

Why do people recommend this instead of the much better oobabooga text-gen-webui?

https://github.com/oobabooga/text-generation-webui

It's like you hate settings, features, and access to many backends!

TechDebtDevin · on July 16, 2024

To each their own, how are you using these extra features? I personally am not looking to spend a bunch on API credits and don't have the hardware to run models larger than 7-8b parameters. I use local llms almost exclusively for formatting notes and as a reading assistant/summarizer and therefor don't need these features.

currycurry16 · on July 16, 2024

Find good models here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

Check hardware requirements here: https://rahulschand.github.io/gpu_poor/

nabakin · on July 16, 2024

Here's a summary of what's happened the past couple of years and what tools are out there.

After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.

Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.

At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].

Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.

A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.

Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.

At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.

However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.

The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.

Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.

Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.

With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.

Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.

More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.

Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.

If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.

If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.

[1] https://ai.meta.com/blog/large-language-model-llama-meta-ai/

[2] https://github.com/huggingface/transformers

[3] https://github.com/ggerganov/llama.cpp

[4] https://github.com/bitsandbytes-foundation/bitsandbytes

[5] https://github.com/vllm-project/vllm

[6] https://ai.meta.com/blog/llama-2/

[7] https://www.nomic.ai/gpt4all

[8] http://ollama.ai/

[9] https://github.com/turboderp/exllamav2

[10] https://github.com/NVIDIA/TensorRT-LLM

[11] https://mistral.ai/news/mixtral-of-experts/

[12] https://cohere.com/blog/command-r-plus-microsoft-azure

[13] https://ai.meta.com/blog/meta-llama-3/

[14] https://blog.google/technology/developers/google-gemma-2/

[15] https://github.com/open-webui/open-webui

visarga · on July 17, 2024

Overall a good write up, but I have a few quips:

> Awhile later Meta released LLaMA[1],

I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.

> The company Mistral had proven itself in the past with very impressive LLaMA finetunes

Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.

> Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy

Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.

nabakin · on July 17, 2024

> I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.

Sure, I was only covering LLMs though. If I wanted to cover image generation models and tools as well, the comment would be double its size.

> Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.

Oh, that's right. Iirc it was just the Llama 2 architecture that was used with sliding window attention.

> Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.

I'm well aware of how quantization works. I meant quantization methods were increasingly able to retain accuracy. Such as methods which quantize less important weights more heavily, improving accuracy for the same LLM size.

psychoslave · on July 16, 2024

This is one one the most useful and informative comment I ever faced on HN. Thank you very much.

iAmAPencilYo · on July 16, 2024

Thank you! Very helpful as a newbie coming in.

holoduke · on July 16, 2024

Great info. Do you also know the state of the code assistants? Any thoughts on copilot versus others?

hobofan · on July 16, 2024

All the main IDE-integrated ones seem very much on par (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh liked by some as it has code assistant-first UI.

I've personally went back to the browser with Claude 3.5 Sonnet (and the projects + artifacts feature), as it is one of the most industrious ones, and I really like the UX of artifacts + it integrates new code well into existing code you paste into it.

In the end I think it also often comes down to what languages/frameworks you are using and how well the LLM/product handles it, so I'd still recommend to test around. E.g. some of the main frameworks I'm working with on a daily basis went through big refactors/interface changes 1-2 years ago, and I stopped using ChatGPT because it had a strong tendency to produce code based on the old interfaces/paradigms.

Aider[0] is also quite interesting, especially when it comes to more significant refactorings in the codebase and has gotten quite good with that with the last few bigger model releases, but it takes same time to get used to and doesn't have good IDE-integration.

[0]: https://github.com/paul-gauthier/aider

nabakin · on July 16, 2024

I've been following the state of things, but I'm not sure which ones are the best. There's Meta's CodeLlama[1], Mistral's Codestral[2], DeepSeek AI's DeepSeek-Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5], and Microsoft's WizardCoder[6].

I'm pretty sure CodeLlama is out of date now. I've heard DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was released recently. With the good reputation and its massive size (236b) I'd guess it is the best coding LLM, but if it's not being trained efficiently, maybe Codestral and Codestral Mamba come close.

I don't think the best coding LLMs are close to GitHub Copilot but I could be wrong since I'm just relaying information that I've heard secondhand.

[1] https://ai.meta.com/blog/code-llama-large-language-model-cod...

[2] https://mistral.ai/news/codestral/

[3] https://github.com/deepseek-ai/DeepSeek-Coder-V2

[4] https://developers.googleblog.com/en/gemma-family-expands-wi...

[5] https://qwenlm.github.io/blog/codeqwen1.5/

[6] https://github.com/nlpxucan/WizardLM

attentive · on July 17, 2024

try THUDM/codegeex4-all-9b

ygouzerh · on July 17, 2024

Wow very useful comment, thank you very much for all the work to write it!

TechDebtDevin · on July 16, 2024

Most the 7b instruct models are very bad outside very simple queries.

You can run a 7b on most modern hardware.How fast will vary.

To run 30-70b models you're getting in the realm of needing 24gb or more of vRAM.

dTal · on July 16, 2024

>Most the 7b instruct models are very bad outside very simple queries.

I can't agree with "very bad". Maybe your standards are set by the best, largest models, but have a little perspective: a modern 7b model is a friggin magical piece of software. Fully in the realm of sci-fi until basically last Tuesday. It can reliably summarize documents, bash a 30 minute rambling voice note into a terse proposal, and give you social counseling at least on par with r/Relationship_Advice. It might not always get facts exactly right but it is smart in a way that computers have never been before. And for all this capability, you can get it running on a computer a decade old, maybe even a Raspberry Pi or a smartphone.

To answer the parent: Download a "gguf" file (blob of weights) of a popular model like Mistral from HugginFace. Git pull and compile llama.cpp. Run ./main -m path/to/gguf -p "prompt"

visarga · on July 17, 2024

Even better, install ollama and then do "ollama run llama3", it works like docker, pulls the model locally and starts a chat session right there in the terminal. No need to compile. Or just run the docker image "ollama/ollama".

Agentus · on July 16, 2024

I'm looking to run something on a 24gb GPU for the purpose of running wild with agentic use of LLMs. Is there anything worth trying that would fit on that amount of vRAM? Or are all the open-source PC-sized LLMs laughable still?

TechDebtDevin · on July 16, 2024

You can run the llama 70b based models faster than 10 tkn/s on 24gb vram. I've found that the quality of this class of LLMs is heavily swayed by your configuration and system prompting and results may vary. This Reddit post seems to have some input on the topic:

https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_3...

I haven't used any agent frameworks other than messing around with langchain a bit so I can't speak to how that would effect things.

Zambyte · on July 17, 2024

You would probably get the same tokens per second with llama 3 70b if you just unplugged the 24gb GPU. For something that actually fits in 24gb of VRAM, I recommend gemma 2 27b up to q6. I use q4 and it works quite well for my needs.

sva_ · on July 16, 2024

If you mean LLM in general, maybe try llamafile first

https://github.com/Mozilla-Ocho/llamafile

derefr · on July 16, 2024

For running LLMs, I think most people just dive into https://www.reddit.com/r/LocalLLaMA/ and start reading.

Not sure what the equivalent is for image generation; it's either https://www.reddit.com/r/StableDiffusion/ or one of the related subreddits it links to.

Sadly, I've yet to find anyone doing "daily ML-hobbyist news" content creation, summarizing the types of articles that appear on these subreddits. (Which is a surprise to me, as it's really easy to find e.g. "daily homelab news" content creators. Please, someone, start a "daily ML-hobbyist news" blog/channel! Given that the target audience would essentially be "people who will get an itch to buy a better GPU soon", the CPM you'd earn on ad impressions would be really high...)

---

That being said, just to get you started, here's a few things to know at present about "what you can run locally":

1. Most models (of the architectures people care about today) will probably fit on a GPU which has something like 1.5x the VRAM of the model's parameter-weights size. So e.g. a "7B" (7 billion parameter-weights) model, will fit on a GPU that has 12GB of VRAM. (You can potentially squeeze even tighter if you have a machine with integrated graphics + dedicated GPU, and you're using the integrated graphics as graphics, leaving the GPU's VRAM free to only hold the model.)

2. There are models that come in all sorts of sizes. Many open-source ML models are huge (70B, 120B, 144B — things you'd need datacenter-class GPUs to run), but then versions of these same models get released which have been heavily cut down (pruned and/or quantized), to force them to fit into smaller VRAM sizes. There are 5B, 3B, 1B, even 0.5B models (although the last two are usually special-purpose models.)

3. Surprisingly, depending on your use-case, smaller models (or small quants of larger models) can "mostly" work perfectly well! They just have more edge-cases where something will send them off the rails spiralling into nonsense — so they're less reliable than their larger cousins. You might have to give them more prompting, and try regenerating their output from the same prompt several times, to get good results.

4. Apple Silicon Macs have a GPU and TPU that read from/write to the same unified memory that the CPU does. While this makes these devices slower for inference than "real" GPUs with dedicated VRAM, it means that if you happen to own a Mac with 16GB of RAM, then you own something that can run 7B models. AS Macs are, oddly enough, the "cheapest" things you can buy in terms of model-capacity-per-dollar. (Unlike a "real" GPU, they won't be especially quick and won't have any capacity for concurrent model inference, so you'd never use one as a server backing an Inference-as-a-Service business. But for home use? No real downsides.)

_kidlike · on July 16, 2024

not sure about the history/progression part, but there's ollama which makes it possible to run models locally. The UX of ollama is similar to docker.