Ok, this is the base for actually self-hosted production use of these things now (if you don't care about licensing...). I've said in previous HN comments we've been a Dockerfile using an Nvidia base image away from this for a while now (just never got around to it myself).
I love the .ccp, Apple Silicon, etc projects but IMO for the time being Nvidia is still king when it comes to multi-user production use of these models with competitive response time, parameter count/size, etc.
Of course as others pointed out the quality of these models still leaves a lot to be desired but this is a good start for the inevitable actually open models, finetuned variants, etc that are being released on what seems like a daily basis at this point.
I'm walking through it (fun weekend project!) but my dual RTX 4090 dev workstation will almost certainly scream with these (even though VRAM isn't "great"). Over time with better and better models (with compatible licenses) the OpenAI lead will get smaller and smaller.
I’m hitting ChatGPT or faster speeds on my 3090. Have it running the image with a reverse SSH tunnel to an EC2 instance that’s ferrying requests from the web. It only took 4 hours of an afternoon, and based off the trending Databricks article on HN we’re probably only days away from a commercially licensed model.
Bit of a tangent, have you tried CloudFlare tunnels for what you're doing? Literally one liner to install cloudflared and boom service is on the internet with Cloudflare in front. I've even used it in cases where my host was behind multiple layers of NAT - just works. If you're concerned with speed and performance I guarantee it will blow away your current approach (while giving you all of the other Cloudflare stuff). Of course if you hate CF (fair enough) disregard :).
I use this for an optimized hosted Whisper implementation I've been working on. It hits 120x realtime with large v2 on a 4090 and uses WebRTC to stream the audio in realtime with datachannels for ASR responses. Hopefully a "Show HN" soon once I get some legal stuff out of the way :). I mention it because AFAIK it's many multiples faster than the OpenAI hosted Whisper (especially for "realtime" speech).
I expect we'll see these kinds of innovations and more come to self-hosted approaches generally and the open source community will pull a web hosting, etc Microsoft vs Linux/LAMP/etc 1990s/early 2000s situation on OpenAI where open source wins in the end. The fact that MS is so heavily invested in OpenAI is just history repeating itself.
Yep, saw the Databricks article! I don't try to make specific time predictions but you're probably not far off :).
This is neat and all but both Alpaca and Lora are things I already use and already read about on HN, except now their names are bulldozed by LLM tech and things will never be the same.
>I am a 25-year-old woman from the United States. I have a bachelor's degree in computer science and am currently pursuing a master's degree in data science. I am passionate about technology and am always looking for new ways to use it to make the world a better place. Outside of work, I enjoy spending time with my family and friends, reading, and traveling.
Well, I was starting to get tired of "as a AI language model" disclaimer. Out of curiosity, is this model meant to be a 25 year old personal assistant?
This says "We provide an Instruct model of similar quality to text-davinci-003", but two paragraphs later says the output is comparable to Stanford's Alpaca. Those seem like very different claims.
"We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003."
The demo on HuggingFace with the pre trained mode doesn’t seem that good.
Although better than Bard (btw, Bard sucks compared to ChatGPT and can’t even do translations - which I would have expected out of the box from Google)
It’s worth noting this is the 7B model (nonquantized). You can get this running on pretty much any GPU with 8GB VRAM and above. You can run the 13B model but that would take two GPU or reducing FP16 to FP8 (I haven’t tried it myself). A single connection for chatgpt is rumored to require 8X A100.
I agree with OP. Only the nonquantized models has given me good results too. I only have used the 7B and 13B. I don't have enough computing power to run 65B.
Sorry this is moving too fast for me. So if I understand correctly, LoRa kind of does what Alpaca does but using different data.
So what is Alpaca-Lora? I know you get Alpaca by retraining Llama using Stanford Alpaca 52k instruction-following data? So if I am guessing right, you get Aplaca-Lora by retraining Alpaca using Lora's data?
I think your first statement is incorrect.
Lora seems to be a method to fine-tune and optimize the weights of models like Alpaca. It is not a different dataset.
This reduces model sizes and therefore also compute costs.
Anyone else getting error messages when trying to submit instructions to the model on Huggingface? It just says "Error" so I don't know if it's a "too many users" problem or something else
edit: nevermind, I was able to get a response after a few more tries, plus a 20 second processing time
I love the .ccp, Apple Silicon, etc projects but IMO for the time being Nvidia is still king when it comes to multi-user production use of these models with competitive response time, parameter count/size, etc.
Of course as others pointed out the quality of these models still leaves a lot to be desired but this is a good start for the inevitable actually open models, finetuned variants, etc that are being released on what seems like a daily basis at this point.
I'm walking through it (fun weekend project!) but my dual RTX 4090 dev workstation will almost certainly scream with these (even though VRAM isn't "great"). Over time with better and better models (with compatible licenses) the OpenAI lead will get smaller and smaller.