Hacker News new | past | comments | ask | show | jobs | submit login
Continue with LocalAI: An alternative to GitHub's Copilot that runs locally (reddit.com)
116 points by tsyklon on Aug 28, 2023 | hide | past | favorite | 33 comments



You don't need so many layers of stuff (or API keys, signups, or other nonsense).

Llama.cpp (to serve the model) + the Continue VS Code extension are enough.

The rough list of steps to do so are:

  Part A: Install llama.cpp and get it to serve the model:
  --------------------------------------------------------
  1. Install the llama.cpp repo and run make.
  2. Download the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf).
  3. Run the llama.cpp server (e.g., ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock).
  4. Run the OpenAI like API server [also included in llama.cpp] (e.g., python ./examples/server/api_like_OAI.py).

  Part B: Install Continue and connect it to llama.cpp's OpenAI like API:
  -----------------------------------------------------------------------
  5. Install the Continue extension in VS Code.
  6. In the Continue extension's sidebar, click through the tutorial and then type /config to access the configuration.
  7. In the Continue configuration, add "from continuedev.src.continuedev.libs.llm.ggml import GGML" at the top of the file.
  8. In the Continue configuration, replace lines 57 to 62 (or around) with:

    models=Models(
        default=GGML(
            max_context_length=16384,
            server_url="http://localhost:8081"
        )
    ),

  9. Restart VS Code, and enjoy!
You can access your local coding LLM through the Continue sidebar now.


One of the most annoying things about learning ai/ml for me right now is how much of this stuff is hidden behind people's comlanies and projects with to many emojis.

Like I can't find simple straight foward solutions or content that isn't tied back to a company.


I'm a complete beginner regarding this stuff, so if I may ask, how would I go about downloading the relevant model (e.g. wizardcoder-python-34b-v1.0.Q4_K_S.gguf) I checked on Hugging face but all I got was a bunch of .bin files...

Thanks.


Do a search on the HuggingFace models page, e.g.:

https://huggingface.co/models?sort=trending&search=wizardcod...


Thanks, I managed to convert what I had downloaded with the convert.py script in llama.cpp.


Google the filename + "torrent download"


Thanks, works nicely and easy to set up.

Is it possible to use GPU for this? With R9 7900x and 32GB RAM it takes 15-30sec to generate response. I have a 6900XT which might be more suited for this.


Yes. In the llama.cpp server command, specify the number of layers you'd like offloaded to your GPU via the -ngl parameter, e.g.:

  ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60
(You might need to play around with the number of layers.)

[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]


Is there a way to make it work with ooba+exllama? (much faster than llamacpp)


You should be able to turn on the API in booba:

https://github.com/oobabooga/text-generation-webui#api


But that API isn't OpenAI compatible AFAIK


Thx. Where can I send flowers to?


To any person you're in a position to be kind to.


wodner if you can pair with https://github.com/getumbrel/llama-gpt


The funny thing about the commercial model for code-helping AI is programmers are unusually capable of running their own AI, and also unusually concerned with digital privacy, so as soon as open-source alternatives are good enough, this market seems likely to evaporate.

But I don't know if there's enough good public data for open source models to get there.


Great observation!


Continue has a great guide on using the new Code Llama model launched by Facebook last week: https://continue.dev/docs/walkthroughs/codellama

Continue also works with various backends and fine-tuned versions of Code Llama. E.g. for a local experience with GPU acceleration on macOS, continue can be used with Ollama (https://github.com/jmorganca/ollama):

  ollama pull codellama

  from continuedev.src.continuedev.libs.llm.ollama import Ollama

  config = ContinueConfig(
    models=Models(
      default=Ollama(model="wizardcoder:34b-python")
    )
  )


Ollama only works on Mac. Here is a portable option:

https://github.com/xnul/code-llama-for-vscode


People have been compiling Ollama to run on Linux. The reason why it's not packaged yet for Linux is due to packaging it with GPU support - at the very least with nvidia support.

Almost there!


Other options for this might include Code Llama (which runs on ollama locally) and looks interesting: https://about.fb.com/news/2023/08/code-llama-ai-for-coding/


Can anyone share how the computer's performance is impacted by running the model locally? And what your specs are?


M1 with 32 GB RAM. I can just about fit the 4-bit quantized 33 GB Code Llama model (and it's finetunes, e.g. WizardCoder, etc.) into memory. It's somewhat slow, but good enough for my purposes.

Edit: when I bought my Macbook in 2021, I was like "Ok, I'll just take the base model and add another 16 GB of RAM. That should future proof it for at least another half-decade." Famous last words.


This is why my rule for laptops with non-upgradable memory has been to max out the RAM at purchase -- and that has been my rule since 2012/2013 or whenever that trend really started.

(written from a 64GB M1 Pro Max)


I wish they offered that much in the Air.


34B Q4 will use around 20GB of memory.

If it's running slow, make sure metal is actually being used[0]. You can get as much as a 50-100% boost in tokens/s, if by chance it's not enabled.

I'm averaging 7 to 8 tokens/s on an M1 Max 10 core (24 GPU cores).

[0] if using llama-cpp-python (or text-generation-webui, ollama, etc) try:

`pip uninstall llama-cpp-python && CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python`


Thank you. I had to reduce the context length to get this to work without crashing (from 16k to 8k)—and I'm seeing the ~100% speed up you mentioned.

However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.

I wonder if there's a way to avoid that slowdown?


I haven't noticed any slowdowns. Maybe check that threads/n_threads is set correctly for your machine (total cores - 2. 10 cores = 8, 8 cores = 6).

n_gpu_layers should also be set to anything other than 0 (default). I don't think the exact number matters for metal, but I use 128.


Are there any LLMs that run on regular (AMD/Intel) CPUs? Or does everything require at least an M1 or a decent GPU?


You can absolutely run LLMs without a GPU, but you need to set expectations for performance. Some projects to look into are

  * llama.cpp - https://github.com/ggerganov/llama.cpp
  * KoboldCpp - https://github.com/LostRuins/koboldcpp
  * GPT4All - https://gpt4all.io/index.html
llama.ccp will run LLMs that have been ported to the gguf format. If you have enough RAM, you can even run the big 70 billion parameter models. If you have a CUDA GPU, you can even offload part of the model onto the GPU and have the CPU do the rest, so you can get some partial performance benefit.

The issue is that the big models run too slowly on a CPU to feel interactive. Without a GPU, you'll get much more reasonable performance running a smaller 7 billion parameter model instead. The responses won't be as good as the larger models, but they may still be good enough to be worthwhile.

Also, development in this space is still coming extremely rapidly, especially for specialized models like ones tuned for coding.


They do run, just slowly. Still better than nothing if you want to run something larger than would fit in your VRAM though. The Llama.ccp project is the most popular runtime, but I think all the major ones have a flag like "--cpu".



A demo would be nice.


How does it actually compare to GitHub Copilot?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: