Ask HN: How to get started with local language models?

leopoldj · 2024-03-17T13:23:17 1710681797

I wrote a series of blog post precisely for someone in your situation. I hope you find them useful.

Easiest way to run a local LLM these days is Ollama. You don't need PyTorch or even Python installed.

https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-us...

Hugging Face can be confusing but in the end a very well designed framework.

https://mobiarch.wordpress.com/2024/03/02/start-using-mistra...

vendiddy · 2024-03-17T12:28:51 1710678531

I'm an outsider to the ML space and recently started watching videos from Andrej Karpathy.

His videos really helped me build an intuition of how LLMs work. What I like is that he builds very simple versions of things that are easier to wrap you rhead around.

https://www.youtube.com/@AndrejKarpathy/videos

ActorNightly · 2024-03-18T06:10:36 1710742236

>HuggingFace still confuses me; there's so much stuff out there to read through, and I have been lost since the release of the llama models.

Huggingface is basically 3 things. 1) Repo for models, 2) Transformers library (basically some classes on top of the core pytorch that define transformer architecture, plus code to auto dl models from hugging face by name), and 3) Accelerate library, which is basically multi device training-inference.

The thing first to understand about LLMs is quantization. Most original models are uploaded in fp16 format, with different parameter counts. Higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because training gradients need higher resolution. However, fp16 also takes a shitload of ram to store.

Inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized to lower bit resolutions. There are 3 different quantization methods. GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ). Generally its accepted that 4 bit quantization is sufficient enough for accuracy for most things, but generally there is value in higher bit quantization.

Read this: https://archive.ph/2023.11.21-144133/https://towardsdatascie...

If you want to just run llms locally, use Ollama, and use the cli to download the models (iirc most models that Ollama downloads through the cli are GGUF 4 bit quantized). If you are using the CPU and want decent inference speed, use the smallest parameter model. OOtherwise use the highest parameter count one that will fit in your VRAM (or RAM if you are on a Mac, since Ollama supports Apple Silicon)

If you wanna do a little bit more tinkering (like running larger models on a limited resource laptop) you need to become familiar with Accelerate library. Hugging face has most of the models already quantized by TheBloke user, so you can just use the example code for each one on the hugging face page to load the model, then use the Accelerate functions to split it up.

instagib · 2024-03-17T15:08:10 1710688090

You can use webui https://github.com/oobabooga/text-generation-webui

Once you get a version up and running I make a copy before I update it as several times updates have broken my working version and caused headaches.

a decent explanation of parameters outside of reading archive papers: https://github.com/oobabooga/text-generation-webui/wiki/03-%...

a news ai website: https://www.emergentmind.com/

Reddit locallamma and how to prompt an llm: https://old.reddit.com/r/LocalLLaMA/comments/1atyxqz/better_...

Since you mention silly text generation there is also sillytavern which runs on top of another llm software such as webui. https://docs.sillytavern.app/

Const-me · 2024-03-17T20:03:14 1710705794

If you just want to run Mistral on Windows, you could try my port: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

The setup is relatively easy: install .NET runtime, download 4.5 GB model file from BitTorrent, unpack a small ZIP file and run the EXE.

boppo1 · 2024-03-17T11:49:06 1710676146

It might help you to dive into SD and comfyui. Loras, finetunes, embeddings, etc. Are easier to understand from a practical standpoint when you can visually compare their output. That will give you intuition for how similar layers work for LLLMs.

xue160709 · 2024-03-17T15:13:37 1710688417

https://github.com/xue160709/Local-LLM-User-Guideline There is a cookbook

PeterStuer · 2024-03-17T12:52:03 1710679923

You might want to start playing with with Ollama or LmStudio to get you started. The model's weights aren't normally inside your .exe . Typically you can address 128TB of virtual memory on Windows.

https://ollama.com/

https://lmstudio.ai/

mitjam · 2024-03-17T22:02:16 1710712936

If you are on Windows, Ollama and OpenWebUI now are as easy to get started as they have been on macOS and Linux, already. I have created a short video about Ollama on Windows and how to install and use it with OpenWebUI: https://m.youtube.com/watch?v=z8xi44O3hvY

keyvank · 2024-03-21T20:35:47 1711053347

If you want to get deep in LLMs: https://github.com/keyvank/femtoGPT

fzzzy · 2024-03-17T15:47:36 1710690456

How much ram do you have? How much gpu ram do you have? What gpu do you have? What kind of performance are you hoping to get in tokens per second?

fbhabbed · 2024-03-18T19:55:05 1710791705

https://www.reddit.com/r/localllama is your go-to place if you want a community of like minded people interested exactly in this.

TL:DR There's many ways to go about it.

Quick start?

Clone llama.cpp repo or download the .exe or main linux binary from the "Releases" on Github, on the right. If you care about security, do this in a virtual machine (unless you plan to only use unquantised safetensors).

Example syntax: ./llama.cpp/main -i -ins --color -c 0 --split-mode layer --keep -1 --top-p 40 --top-k 0.9 --min-p 0.02 --temp 2.0 --repeat_penalty 1.1 -n -1 --multiline-input -ngl 3 -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf

In this example, I'm running Mixtral at quantisation Q8, with 3 layers offloaded to the GPU, for about 45GB RAM usage and 7GB VRAM (GPU) usage. To make sense of quants, this is the general rule: you pick the largest quant you can run with your RAM.

If you go look for TheBloke models, they all have a handy model card stating how much RAM each quantisation uses.

I tend to use GGUF versions, which run on CPU but can have some layers offloaded on GPU.

I definitely recommend reading the https://github.com/ggerganov/llama.cpp documentation.

diordiderot · 2024-03-18T11:40:36 1710762036

https://ollama.com/

exe34 · 2024-03-17T15:24:37 1710689077

Have you tried llama.cpp?

ShamelessC · 2024-03-17T10:59:27 1710673167

Sounds like you’re on the right track. Persistence is key.