Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to get started with local language models?
114 points by sandwichukulele 10 months ago | hide | past | favorite | 15 comments
I remember using Talk to a Transformer in 2019 and making little Markov chains for silly text generation. I've lurked in the g /lmg/ threads, installed the Umbrel LlamaGPT on a Raspberry Pi, and ran webGPU WebGPT (GPT-2) locally. But I don't know how anything works or how to do anything besides following installation instructions on GitHub. HuggingFace still confuses me; there's so much stuff out there to read through, and I have been lost since the release of the llama models. I heard about Mistral. I tried using the Mozilla Llamafiles since it was meant to be easy for the Mixtral-8x7B-Instruct, but apparently, windows won't run an EXE bigger than 4GB, so I have to do something about the weights, and I freeze here. I know where to look and how to learn about any other technology, but I'm completely lost for how to learn about local models when everything is moving so fast.

I'm missing something fundamental. How can I understand these technologies?




I wrote a series of blog post precisely for someone in your situation. I hope you find them useful.

Easiest way to run a local LLM these days is Ollama. You don't need PyTorch or even Python installed.

https://mobiarch.wordpress.com/2024/02/19/run-rag-locally-us...

Hugging Face can be confusing but in the end a very well designed framework.

https://mobiarch.wordpress.com/2024/03/02/start-using-mistra...


I'm an outsider to the ML space and recently started watching videos from Andrej Karpathy.

His videos really helped me build an intuition of how LLMs work. What I like is that he builds very simple versions of things that are easier to wrap you rhead around.

https://www.youtube.com/@AndrejKarpathy/videos


>HuggingFace still confuses me; there's so much stuff out there to read through, and I have been lost since the release of the llama models.

Huggingface is basically 3 things. 1) Repo for models, 2) Transformers library (basically some classes on top of the core pytorch that define transformer architecture, plus code to auto dl models from hugging face by name), and 3) Accelerate library, which is basically multi device training-inference.

The thing first to understand about LLMs is quantization. Most original models are uploaded in fp16 format, with different parameter counts. Higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because training gradients need higher resolution. However, fp16 also takes a shitload of ram to store.

Inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized to lower bit resolutions. There are 3 different quantization methods. GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ). Generally its accepted that 4 bit quantization is sufficient enough for accuracy for most things, but generally there is value in higher bit quantization.

Read this: https://archive.ph/2023.11.21-144133/https://towardsdatascie...

If you want to just run llms locally, use Ollama, and use the cli to download the models (iirc most models that Ollama downloads through the cli are GGUF 4 bit quantized). If you are using the CPU and want decent inference speed, use the smallest parameter model. OOtherwise use the highest parameter count one that will fit in your VRAM (or RAM if you are on a Mac, since Ollama supports Apple Silicon)

If you wanna do a little bit more tinkering (like running larger models on a limited resource laptop) you need to become familiar with Accelerate library. Hugging face has most of the models already quantized by TheBloke user, so you can just use the example code for each one on the hugging face page to load the model, then use the Accelerate functions to split it up.


You can use webui https://github.com/oobabooga/text-generation-webui

Once you get a version up and running I make a copy before I update it as several times updates have broken my working version and caused headaches.

a decent explanation of parameters outside of reading archive papers: https://github.com/oobabooga/text-generation-webui/wiki/03-%...

a news ai website: https://www.emergentmind.com/

Reddit locallamma and how to prompt an llm: https://old.reddit.com/r/LocalLLaMA/comments/1atyxqz/better_...

Since you mention silly text generation there is also sillytavern which runs on top of another llm software such as webui. https://docs.sillytavern.app/


If you just want to run Mistral on Windows, you could try my port: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

The setup is relatively easy: install .NET runtime, download 4.5 GB model file from BitTorrent, unpack a small ZIP file and run the EXE.


It might help you to dive into SD and comfyui. Loras, finetunes, embeddings, etc. Are easier to understand from a practical standpoint when you can visually compare their output. That will give you intuition for how similar layers work for LLLMs.



You might want to start playing with with Ollama or LmStudio to get you started. The model's weights aren't normally inside your .exe . Typically you can address 128TB of virtual memory on Windows.

https://ollama.com/

https://lmstudio.ai/


If you are on Windows, Ollama and OpenWebUI now are as easy to get started as they have been on macOS and Linux, already. I have created a short video about Ollama on Windows and how to install and use it with OpenWebUI: https://m.youtube.com/watch?v=z8xi44O3hvY


If you want to get deep in LLMs: https://github.com/keyvank/femtoGPT


How much ram do you have? How much gpu ram do you have? What gpu do you have? What kind of performance are you hoping to get in tokens per second?


https://www.reddit.com/r/localllama is your go-to place if you want a community of like minded people interested exactly in this.

TL:DR There's many ways to go about it.

Quick start?

Clone llama.cpp repo or download the .exe or main linux binary from the "Releases" on Github, on the right. If you care about security, do this in a virtual machine (unless you plan to only use unquantised safetensors).

Example syntax: ./llama.cpp/main -i -ins --color -c 0 --split-mode layer --keep -1 --top-p 40 --top-k 0.9 --min-p 0.02 --temp 2.0 --repeat_penalty 1.1 -n -1 --multiline-input -ngl 3 -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf

In this example, I'm running Mixtral at quantisation Q8, with 3 layers offloaded to the GPU, for about 45GB RAM usage and 7GB VRAM (GPU) usage. To make sense of quants, this is the general rule: you pick the largest quant you can run with your RAM.

If you go look for TheBloke models, they all have a handy model card stating how much RAM each quantisation uses.

I tend to use GGUF versions, which run on CPU but can have some layers offloaded on GPU.

I definitely recommend reading the https://github.com/ggerganov/llama.cpp documentation.



Have you tried llama.cpp?


Sounds like you’re on the right track. Persistence is key.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: