Alpaca-LoRA with Docker

kkielhofner · on March 24, 2023

Ok, this is the base for actually self-hosted production use of these things now (if you don't care about licensing...). I've said in previous HN comments we've been a Dockerfile using an Nvidia base image away from this for a while now (just never got around to it myself).

I love the .ccp, Apple Silicon, etc projects but IMO for the time being Nvidia is still king when it comes to multi-user production use of these models with competitive response time, parameter count/size, etc.

Of course as others pointed out the quality of these models still leaves a lot to be desired but this is a good start for the inevitable actually open models, finetuned variants, etc that are being released on what seems like a daily basis at this point.

I'm walking through it (fun weekend project!) but my dual RTX 4090 dev workstation will almost certainly scream with these (even though VRAM isn't "great"). Over time with better and better models (with compatible licenses) the OpenAI lead will get smaller and smaller.

cuuupid · on March 24, 2023

I’m hitting ChatGPT or faster speeds on my 3090. Have it running the image with a reverse SSH tunnel to an EC2 instance that’s ferrying requests from the web. It only took 4 hours of an afternoon, and based off the trending Databricks article on HN we’re probably only days away from a commercially licensed model.

kkielhofner · on March 24, 2023

Bit of a tangent, have you tried CloudFlare tunnels for what you're doing? Literally one liner to install cloudflared and boom service is on the internet with Cloudflare in front. I've even used it in cases where my host was behind multiple layers of NAT - just works. If you're concerned with speed and performance I guarantee it will blow away your current approach (while giving you all of the other Cloudflare stuff). Of course if you hate CF (fair enough) disregard :).

I use this for an optimized hosted Whisper implementation I've been working on. It hits 120x realtime with large v2 on a 4090 and uses WebRTC to stream the audio in realtime with datachannels for ASR responses. Hopefully a "Show HN" soon once I get some legal stuff out of the way :). I mention it because AFAIK it's many multiples faster than the OpenAI hosted Whisper (especially for "realtime" speech).

I expect we'll see these kinds of innovations and more come to self-hosted approaches generally and the open source community will pull a web hosting, etc Microsoft vs Linux/LAMP/etc 1990s/early 2000s situation on OpenAI where open source wins in the end. The fact that MS is so heavily invested in OpenAI is just history repeating itself.

Yep, saw the Databricks article! I don't try to make specific time predictions but you're probably not far off :).

jvanderbot · on March 24, 2023

This is neat and all but both Alpaca and Lora are things I already use and already read about on HN, except now their names are bulldozed by LLM tech and things will never be the same.

gitfan86 · on March 24, 2023

Just run all your web browsing through GTP and tell it to differentiate them for you

HnUser12 · on March 24, 2023

> Tell me about you

>I am a 25-year-old woman from the United States. I have a bachelor's degree in computer science and am currently pursuing a master's degree in data science. I am passionate about technology and am always looking for new ways to use it to make the world a better place. Outside of work, I enjoy spending time with my family and friends, reading, and traveling.

Well, I was starting to get tired of "as a AI language model" disclaimer. Out of curiosity, is this model meant to be a 25 year old personal assistant?

jonny_eh · on March 24, 2023

No, it's just random "plausible" response. Re-roll the response and you'll get something different.

Think of the prompt as "pretend you're some random person, tell me some details"

HnUser12 · on March 27, 2023

Makes sense! Thanks!

teekert · on March 24, 2023

That name is so unfortunate. Nobody searched “Lora” before picking it. Bit of a blunder if you ask me.

b33j0r · on March 24, 2023

They even capitalize the R like LoRa, but I don’t think we’ll be running this model on an ESP32 to much profit.

Perhaps someone will release a llama I can run at home… how about “llama-homekit”? ;)

sp332 · on March 24, 2023

This says "We provide an Instruct model of similar quality to text-davinci-003", but two paragraphs later says the output is comparable to Stanford's Alpaca. Those seem like very different claims.

MacsHeadroom · on March 24, 2023

"We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003."

https://crfm.stanford.edu/2023/03/13/alpaca.html

NLPaep · on March 25, 2023

Except when ChatGPT wins, it wins hard

nico · on March 24, 2023

The demo on HuggingFace with the pre trained mode doesn’t seem that good.

Although better than Bard (btw, Bard sucks compared to ChatGPT and can’t even do translations - which I would have expected out of the box from Google)

syntaxing · on March 24, 2023

It’s worth noting this is the 7B model (nonquantized). You can get this running on pretty much any GPU with 8GB VRAM and above. You can run the 13B model but that would take two GPU or reducing FP16 to FP8 (I haven’t tried it myself). A single connection for chatgpt is rumored to require 8X A100.

zamalek · on March 24, 2023

Quantizing it to 8-bit basically eliminates its ability to write code.

Taek · on March 25, 2023

?

All the research I've seen says quantization has basically negligible performance impact. My experience working with 65B 4bit has been great

syntaxing · on March 25, 2023

I agree with OP. Only the nonquantized models has given me good results too. I only have used the 7B and 13B. I don't have enough computing power to run 65B.

nico · on March 24, 2023

It makes me wonder if this trend will kill NVIDIA.

At this pace we might not even need GPUs anymore.

jnwatson · on March 24, 2023

The race for bigger NNs will never stop.

Havoc · on March 24, 2023

Quite the opposite. This entire trend is a godsend for them. If anything it opens more markets in lower powered tiers

DANmode · on March 25, 2023

Have you tried OpenAssistant?

zapdrive · on March 24, 2023

Sorry this is moving too fast for me. So if I understand correctly, LoRa kind of does what Alpaca does but using different data.

So what is Alpaca-Lora? I know you get Alpaca by retraining Llama using Stanford Alpaca 52k instruction-following data? So if I am guessing right, you get Aplaca-Lora by retraining Alpaca using Lora's data?

return_to_monke · on March 24, 2023

I think your first statement is incorrect. Lora seems to be a method to fine-tune and optimize the weights of models like Alpaca. It is not a different dataset.

This reduces model sizes and therefore also compute costs.

See the abstract of https://arxiv.org/pdf/2106.09685.pdf

ChrisAlexiuk · on March 24, 2023

Hey! Thanks for linking this!

The work was all done by the original repo author - just added a Dockerfile!

danso · on March 24, 2023

From the repo README:

> Try the pretrained model out here, courtesy of a GPU grant from Huggingface!

https://huggingface.co/spaces/tloen/alpaca-lora

Anyone else getting error messages when trying to submit instructions to the model on Huggingface? It just says "Error" so I don't know if it's a "too many users" problem or something else

edit: nevermind, I was able to get a response after a few more tries, plus a 20 second processing time

schappim · on March 24, 2023

I never thought that both Alpaca and LoRA would belong to such a crowded tech namespace…

saurik · on March 24, 2023

Yesterday there was a discussion about an article which goes into the usage of Alpaca-LoRA.

https://news.ycombinator.com/item?id=35279656

yieldcrv · on March 24, 2023

cloned, hmu if that repo gets nuked

futureshock · on March 24, 2023

What is the final size of the weights?