I'm a bit worried the LLaMA leak will make the labs much more cautious about who they distribute models to for future projects, closing down things even more.
I've had tons of fun implementing LLaMA, learning and playing around with variations like Vicuna. I learned a lot and probably wouldn't have got so interested in this space if the leak didn't happen.
On the other side of the coin, they've distracted a huge amount of attention from OpenAI and have open source optimisations appearing for every platform they could ever consider running it on, for no extra expense.
That's a good point. They knew they couldn't compete with ChatGPT (even if performance was comparable, GPT has a massive edge in marketing) so they did the next best thing. This gives Meta a massive boost both to visibility and to open source contributions that ironically no other business can legally use.
As I mentioned in my comment, a leak means that no other company (your competition) can use it, and you get to integrate all the improvements made by other people on it back into your closed source product.
They clearly expected the leak, they distributed it very widely to researchers. The important thing is the licence, not the access: you are not allowed to use it for commercial purpose.
You are certainly partly right, but it's also about liability. Those models might output copyrighted information, which Facebook doesn't want to get sued about. So they restrict the model for research. If someone uses it to replicate copyrighted work, they are not responsible.
Open AI faces the same liability concerns though. I think IP concerns are low on the list given past success of playing fast and loose on emergent capabilities of new tech platforms.
For example, WhatsApp’s greyhat use of smartphone address book.
The US government also has a stake in unbridled growth seems, in general, to give a pass to business exploring new terrain.
Have reasonable suspicion, sue you, and then use discovery to find any evidence at all that your models began with LLaMA. Oh, you don't have substantial evidence for how you went from 0 to a 65B-parameter LLM base model? How curious.
Same way anti-piracy worked in the 90s: cash payouts to whistleblowers. Yes, those whistleblowers are guaranteed to be fired employees with an axe to grind.
LLaMa uses books3 which is a source of pirated books, to train the model.
So either, it is very hypocrite of them to apply DCMA while the model itself is illegal. Or, they are trying to somewhat stop spreading as they know it is illegal.
Anyways, since the training code and data sources are opensource, you 'could' have trained it yourself. But even then, you are still at risk for the pirated books part.
"And as long as they’re going to steal it, we want them to steal ours. They’ll get sort of addicted, and then we’ll somehow figure out how to collect sometime in the next decade".
If the copyright office determines model weights are uncopyrightable (huge if), then one might imagine any institutional leak would benefit everyone else in the space.
You might see hackers, employees, or contractors leaking models more frequently.
And since models are distilled functionality (no microservices and databases to deploy), they're much easier to run than a constellation of cloud infrastructure.
Even if the weights are copyrighted, running one more epoch of fine-tuning will result in different weights. At a certain point, they'd have to copyright the shapes of the weight vectors.
is uncertain, as with codding you need white room methods to prove that new code is not contaminated with patented implementation, as it might be here, so basing anything on an existing model could be also copyrighted.
If you see generating model weights in the same way like generating executable binary from source code, then sure.
But AFAIK this is just the first step to get initial weights and later you need much more work to fine-tune this to get useful results from the model.
I think this step could be seen as contaminating weights with copyrighted content.
Something like chrome is copyrighted but chromium is not
I'm not a lawyer, so I'm not that well informed how official definitions match here, but what's I'm trying to say it that I wouldn't be surprised if this would go either way
With so much money and so many competing interests involved, it'll take decades for this to wind its way through the courts, and by then there's a good chance we'll have strong AI and all such concerns will be moot.
Shouldn't that be the default position? The training methods are certainly patentable, but the actual input to the algorithm is usually public domain, and outputs of algorithms are not generally copyrightable as new works (think of to_lowercase(Harry Potter), which is not a copyrightable work), so the model weights would be a derivative work of public domain materials, and hence also forced into the public domain from a copyright perspective.
They are generally trade secrets now, which is what actually protects them. Leaks of trade secrets are serious business regardless of the IP status of the work otherwise.
For what it's worth, I've been working on a startup that involves training some models, and this is likely how we're going to be treating the legal stuff (and being very careful about how customers can interact with the models as a consequence). I assume people who have different incentives will take a different view, though.
Yes, the person to whom you are responding appears to be mixing up "publicly available" (made available to general public) with "public domain" (not protected by copyright).
IANAL but, I think, as far as US law goes, they have the right conclusion for the wrong reasons. Unsupervised training is an automated process, and the US Copyright Office has said [0] that the product of automated processes can't be copyrighted. While that statement was focused on the output of running an AI model, not the output of its training process (the parameters), I can't see how – for a model produced by unsupervised training – the conclusion would be any different.
This is probably not the case in many non-US jurisdictions, such as the EU, UK, Australia, etc – all of which have far weaker standards for copyrightability than the US does. It may not apply for supervised training – the supervision may be sufficient human input for copyrightability even in the US. It may not apply for AI models trained from copyrighted datasets, where the copyright owner of the dataset is claiming ownership of the model – that is not the case for OpenAI/Google/Meta/etc, who are all using training datasets predominantly copyright by third parties, but maybe Getty Images will build their own Stable Diffusion-style AI based on their image library, and that might give them a way of copyrighting their model which OpenAI/Google/Meta/etc lack.
It is always possible that US Congress will amend the law to make AI parameters copyrightable, or introduce some sui generis non-copyright legal protection for them, like the semiconductor mask work rights which were legislated in response to court rulings that semiconductor masks could not be copyrighted. I think the odds are reasonably high they will in fact do that sooner or later, but nobody knows for certain how things will pan out.
> the product of automated processes can't be copyrighted.
That output could still be covered by copyright: In the case where the input is covered by copyright, the product/output may be considered a derived work, in which case the output is still covered by the same copyright the input was. Your argument just explains why the output will not gain any additional copyright coverage.
The copyright office already determined that AI artifacts are not covered by copyright protections. Any model created through unsupervised learning is this kind of artifact. At they same time they determined that creations that mix ai artifacts with human creation are covered by copyright protection.
I mean it's a good power tool, cuts fast with little effort.
But what's it gonna do in the hands of your parents or kids.. when it gets thing wrong, its could have way worst impact if it's intergrated in government, health care, finance etc..
I've had tons of fun implementing LLaMA, learning and playing around with variations like Vicuna. I learned a lot and probably wouldn't have got so interested in this space if the leak didn't happen.