>HuggingFace still confuses me; there's so much stuff out there to read through, and I have been lost since the release of the llama models.
Huggingface is basically 3 things. 1) Repo for models, 2) Transformers library (basically some classes on top of the core pytorch that define transformer architecture, plus code to auto dl models from hugging face by name), and 3) Accelerate library, which is basically multi device training-inference.
The thing first to understand about LLMs is quantization. Most original models are uploaded in fp16 format, with different parameter counts. Higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because training gradients need higher resolution. However, fp16 also takes a shitload of ram to store.
Inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized to lower bit resolutions. There are 3 different quantization methods. GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ). Generally its accepted that 4 bit quantization is sufficient enough for accuracy for most things, but generally there is value in higher bit quantization.
If you want to just run llms locally, use Ollama, and use the cli to download the models (iirc most models that Ollama downloads through the cli are GGUF 4 bit quantized). If you are using the CPU and want decent inference speed, use the smallest parameter model. OOtherwise use the highest parameter count one that will fit in your VRAM (or RAM if you are on a Mac, since Ollama supports Apple Silicon)
If you wanna do a little bit more tinkering (like running larger models on a limited resource laptop) you need to become familiar with Accelerate library. Hugging face has most of the models already quantized by TheBloke user, so you can just use the example code for each one on the hugging face page to load the model, then use the Accelerate functions to split it up.
Huggingface is basically 3 things. 1) Repo for models, 2) Transformers library (basically some classes on top of the core pytorch that define transformer architecture, plus code to auto dl models from hugging face by name), and 3) Accelerate library, which is basically multi device training-inference.
The thing first to understand about LLMs is quantization. Most original models are uploaded in fp16 format, with different parameter counts. Higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because training gradients need higher resolution. However, fp16 also takes a shitload of ram to store.
Inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized to lower bit resolutions. There are 3 different quantization methods. GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ). Generally its accepted that 4 bit quantization is sufficient enough for accuracy for most things, but generally there is value in higher bit quantization.
Read this: https://archive.ph/2023.11.21-144133/https://towardsdatascie...
If you want to just run llms locally, use Ollama, and use the cli to download the models (iirc most models that Ollama downloads through the cli are GGUF 4 bit quantized). If you are using the CPU and want decent inference speed, use the smallest parameter model. OOtherwise use the highest parameter count one that will fit in your VRAM (or RAM if you are on a Mac, since Ollama supports Apple Silicon)
If you wanna do a little bit more tinkering (like running larger models on a limited resource laptop) you need to become familiar with Accelerate library. Hugging face has most of the models already quantized by TheBloke user, so you can just use the example code for each one on the hugging face page to load the model, then use the Accelerate functions to split it up.