Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm really excited about this project and I think it could be really disruptive. It is organized by LAION, the same folks who curated the dataset used to train Stable Diffusion.

My understanding of the plan is to fine-tune an existing large language model, trained with self-supervised learning on a very large corpus of data, using reinforcement learning from human feedback, which is the same method used in ChatGPT. Once the dataset they are creating is available, though, perhaps better methods can be rapidly developed as it will democratize the ability to do basic research in this space. I'm curious regarding how much more limited the systems they are planning to build will be compared to ChatGPT, since they are planning to make models with far less parameters to deploy them on much more modest hardware than ChatGPT.

As an AI researcher in academia, it is frustrating to be blocked from doing a lot of research in this space due to computational constraints and a lack of the required data. I'm teaching a class this semester on self-supervised and generative AI methods, and it will be fun to let students play around with this in the future.

Here is a video about the Open Assistant effort: https://www.youtube.com/watch?v=64Izfm24FKA



> it is frustrating to be blocked from doing a lot of research in this space due to computational

Do we need a SETI@home-like project to distribute the training computation across many volunteers so we can all benefit from the trained model?


Long story short, training requires intensive device-to-device communication. Distributed training is possible in theory but so inefficient that it's not worth it. Here is a new paper that looks to be the most promising approach yet: https://arxiv.org/abs/2301.11913


It doesn’t, actually. The model weights can be periodically averaged with each other. No need for synchronous gradient broadcasts.

Why people aren’t doing this has always been a mystery to me.

Relevant: https://battle.shawwn.com/swarm-training-v01a.pdf


You linked a paper with no results and no conclusion. Perhaps you meant to link a different paper?


I never finished it.


so it is unproven? what is the value of it?


It’s how we trained roughly 40 GPT 1.5B models. The technique works; it’s up to you to try it out.


The abstract mentions fine-tuning, not full pre-training?


Yeah, sorry for not being precise. We used the technique to fine tune around 40 GPT 1.5B models, including the chess one.

It was very apparent that the technique was working well. The kiss curve suddenly started dropping dramatically the first day we got it working.


I think the landscape has plenty to think about with few explorers able to wrap their wetware around all of it?


Wouldn't other signal propagation approaches, like Forward-Forward, make this easier?


Would have to be federated learning to work I think


That's brilliant, I would love to spare compute cycles and network on my devices for this if there's an open source LLM on the other side that I can use in my own projects, or commercially.

Doesn't feel like there's much competition for ChatGPT at this point otherwise, which can't be good.


On the generative image side of the equation, you can do the same thing with Stable Diffusion[1], thanks to a handy open source distributed computing project called Stable Horde[2].

LAION has started using Stable Horde for aesthetics training to back feed into and improve their datasets for future models[3].

I think one can foresee the same thing eventually happening with LLMs.

Full disclosure: I made ArtBot, which is referenced in both the PC World article and the LAION blog post.

[1] https://www.pcworld.com/article/1431633/meet-stable-horde-th...

[2] https://stablehorde.net/

[3] https://laion.ai/blog/laion-stable-horde/


> Doesn't feel like there's much competition for ChatGPT at this point otherwise, which can't be good.

Facebook open sourced their LLM, called OPT [1]. There's not much else, and OPT isn't exactly easy to run (requires like 8 GPUs).

I'm not an expect, so I don't know why some models, like the graphics generation we've seen, are able to fit on phones, while LLM require $500k worth of GPUs to run. Hopefully this is the first step to changing that.

[1] https://ai.facebook.com/blog/democratizing-access-to-large-s...



I've seen Petals mentioned several times before and I don't think it's the same thing. Correct me if I'm wrong, but it seems Petals is for running distributed inference and fine-tuning of an existing model. What the above poster and I really want to see is distributed training of a new model across a network.

Much like I was able to choose to donate CPU cycles to a wide variety of BOINC-based projects, I want to be able to donate GPU cycles to anyone with a crazy idea for a new ML model - text, image, finance, audio, etc.


I read about something a few weeks ago which does just this! Does anyone know what it's called?


you are probably thinking of https://arxiv.org/abs/2207.03481

for inference, there is https://github.com/bigscience-workshop/petals

however, both are only in the research phase. start tinkering!


Hell it could even be the proof of work for a usable crypto-currency. "Prove that you lowered the error rate compared to SOTA and earn 50 ponzicoins!"


The labelled data seems more of a blocker than anything else. As far as I'm aware, the actually NN running the models are relatively simple, it's the human labor involved in gathering, cleaning, and labeling data for training that is the most resource intensive.


The data is valuable yes, but training a model still requires millions of dollars worth of compute. That's a perfect cost to distribute among volunteers if it could be done.


Yeah man, and youvget access to the model as payment for donati g cycles


Hyperion


Another idea is to dedicate cpu cycles to something else that is easier to distribute, and then use the proceeds for massive amounts of gpu for academic use.

Crypto is an example.


This creates indirection costs and counterparty risks that don't appear in the original solution.


There is also indirection cost by taking something that is optimized to run on GPU’s within the data center and distributing that to individual PCs.


this would be very wasteful


So is trying to distribute training across nodes compared to what can be done inside a data center.


Yannic and the community he has built is such an educational force of good. His youtube videos explaining papers have helped me and so many others as well. Thank you Yannic for all that you do!


> force of good

I think he cares more about freedom than "good". Many people were not happy about his "GPT-4chan" project.

(I'm not judging.)


I don't think those people legitimately cared about the welfare of 4chan users who were experimented on. They just perceived the project to be bad optics that might threaten the AI gravy train.


> It is organized by LAION, the same folks who curated the dataset used to train Stable Diffusion.

I'm guessing, like stable diffusion, it won't be under an open source licence then? (The stable diffusion licence discriminates against fields on endeavour)


You are confusing LAION with Stability.ai. They share some researchers but the former is a completely transparent and open effort which you are free to join and criticize this very moment. The latter is a VC backed effort which does indeed have some of the issues you mention.

Good guess though...


The LICENSE file in the linked repo says it's under the Apache license.


Does this mean that contributions of data, labelling, etc. remain open?

I'm hesitant to spend a single second on these things unless they are truly open.


Yes. The intent is definitely to have the data be as open as possible. And Apache v2.0 is currently where it will stay. This project prefers the simplicity of Apache v2.0 and does not care for the RAIL licenses.


>> As an AI researcher in academia, it is frustrating to be blocked from doing a lot of research in this space due to computational constraints and a lack of the required data.

Computational constraints aside, the data used to train GPT-3 was mainly Open Crawl, which is freely available by a non-profit org:

https://commoncrawl.org/big-picture/frequently-asked-questio...

>> What is Common Crawl?

>> Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

So you just need to find the compute. If you have a class of ~30, it should only take about 150 to 450 million.

Or, you could switch your research and teaching to less compute- and data-intensive approaches? Just because OpenAI and DeepMind et al are championing extremely expensive approaches that only they can realistically use, that's no reason for everyone else to run behind them willy-nilly.


it's sad that upon observing the success of the downstream products such as SD, the creators have chosen to hoard the dataset themselves as the single producers of the downstream products as well


> reinforcement learning from human feedback, which is the same method used in ChatGPT

Is this confirmed? I thought it was not so.


I don't see the relevance of 50k prompt response pairs. With exponential combinations of words this is on the level of what AIML did thirty years ago. Isn't chat gpt trained on (b/)millions of stack overflow and forum responses?


Unfortunately that guy is too distracting for me to watch - he's like a bad 90s Terminator knock off and always in your face waving hands :(


While Yannic is also German, he is actually much better than 90s Terminator:

* he doesn’t want to steal your motorcycle

* he doesn’t care for your leather jacket either

* he is not trying to kill yo mama


Hate to be that guy, but Arnold Schwarzenegger is Austrian.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: