Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a bit worried the LLaMA leak will make the labs much more cautious about who they distribute models to for future projects, closing down things even more.

I've had tons of fun implementing LLaMA, learning and playing around with variations like Vicuna. I learned a lot and probably wouldn't have got so interested in this space if the leak didn't happen.



On the other side of the coin, they've distracted a huge amount of attention from OpenAI and have open source optimisations appearing for every platform they could ever consider running it on, for no extra expense.

If it was a deliberate leak, it was a good idea.


That's a good point. They knew they couldn't compete with ChatGPT (even if performance was comparable, GPT has a massive edge in marketing) so they did the next best thing. This gives Meta a massive boost both to visibility and to open source contributions that ironically no other business can legally use.


If it was deliberate then why "leak" it instead of open sourcing it?


You avoid taking flak from the Responsible AI people that way


Ding ding ding. "Leaks" are sometimes a strategy play.


As I mentioned in my comment, a leak means that no other company (your competition) can use it, and you get to integrate all the improvements made by other people on it back into your closed source product.


They clearly expected the leak, they distributed it very widely to researchers. The important thing is the licence, not the access: you are not allowed to use it for commercial purpose.


How could Meta ever find out your private business is using their model without a whistleblower? It's practically impossible.


This is an old playbook from Facebook, where the company creates rules that they know they can not detect violation of.

This gives the company plausible deniability while still allowing ~unrestricted growth.

Persistent storage (in violation of TOS) and illicit use of Facebook users’ personal data was available to app developers for a long time.

It encouraged development of viral applications while throwing off massive value to those willing to break the published rules.

This resulted in outsized and unexpected repercussions though, including the Cambridge Analytica scandal.

People should be wary of the development as much as they are enthused. The power is immense and potential for abuse far from understood.


You are certainly partly right, but it's also about liability. Those models might output copyrighted information, which Facebook doesn't want to get sued about. So they restrict the model for research. If someone uses it to replicate copyrighted work, they are not responsible.


Open AI faces the same liability concerns though. I think IP concerns are low on the list given past success of playing fast and loose on emergent capabilities of new tech platforms.

For example, WhatsApp’s greyhat use of smartphone address book.

The US government also has a stake in unbridled growth seems, in general, to give a pass to business exploring new terrain.


I think you can make that argument for all behind-the-scenes commercial copyright infringement, surely?


Have reasonable suspicion, sue you, and then use discovery to find any evidence at all that your models began with LLaMA. Oh, you don't have substantial evidence for how you went from 0 to a 65B-parameter LLM base model? How curious.


Fell off the back of a truck!


Recovered it from a boating accident.


Yes, that's how software piracy has always worked.


You can just ask if there is no output filtering


The future is going to be hilarious. Just ask the model who made it!


Does the model know, or will it just hallucinate an answer?


Probably both.


Same way anti-piracy worked in the 90s: cash payouts to whistleblowers. Yes, those whistleblowers are guaranteed to be fired employees with an axe to grind.


LLaMa uses books3 which is a source of pirated books, to train the model.

So either, it is very hypocrite of them to apply DCMA while the model itself is illegal. Or, they are trying to somewhat stop spreading as they know it is illegal.

Anyways, since the training code and data sources are opensource, you 'could' have trained it yourself. But even then, you are still at risk for the pirated books part.


An alternative interpretation was the LLaMa leak was an effort to shake or curtail the progress of ChatGPT's viral dominance at the time.


"And as long as they’re going to steal it, we want them to steal ours. They’ll get sort of addicted, and then we’ll somehow figure out how to collect sometime in the next decade".

That was ironically Bill Gates

https://www.latimes.com/archives/la-xpm-2006-apr-09-fi-micro...



If the copyright office determines model weights are uncopyrightable (huge if), then one might imagine any institutional leak would benefit everyone else in the space.

You might see hackers, employees, or contractors leaking models more frequently.

And since models are distilled functionality (no microservices and databases to deploy), they're much easier to run than a constellation of cloud infrastructure.


Even if the weights are copyrighted, running one more epoch of fine-tuning will result in different weights. At a certain point, they'd have to copyright the shapes of the weight vectors.


is uncertain, as with codding you need white room methods to prove that new code is not contaminated with patented implementation, as it might be here, so basing anything on an existing model could be also copyrighted.


Clean room implementation is not a defense against patents, it is a defense against copyright infringement.


The model isn't code to a new model trained on it, it's training data; just like the pirated torrent site Books3 dataset Facebook used to train LLaMA.

The training code is Apache 2.0 licensed so it can be copied and modified freely, including for commercial purpoes. https://github.com/facebookresearch/llama


If you see generating model weights in the same way like generating executable binary from source code, then sure.

But AFAIK this is just the first step to get initial weights and later you need much more work to fine-tune this to get useful results from the model.

I think this step could be seen as contaminating weights with copyrighted content.

Something like chrome is copyrighted but chromium is not

I'm not a lawyer, so I'm not that well informed how official definitions match here, but what's I'm trying to say it that I wouldn't be surprised if this would go either way


With so much money and so many competing interests involved, it'll take decades for this to wind its way through the courts, and by then there's a good chance we'll have strong AI and all such concerns will be moot.


Shouldn't that be the default position? The training methods are certainly patentable, but the actual input to the algorithm is usually public domain, and outputs of algorithms are not generally copyrightable as new works (think of to_lowercase(Harry Potter), which is not a copyrightable work), so the model weights would be a derivative work of public domain materials, and hence also forced into the public domain from a copyright perspective.

They are generally trade secrets now, which is what actually protects them. Leaks of trade secrets are serious business regardless of the IP status of the work otherwise.


I like your legal interpretation, but it's way too early to tell if it is one that accurately represents the reality of the situation.

We won't know until this hits the courts.


For what it's worth, I've been working on a startup that involves training some models, and this is likely how we're going to be treating the legal stuff (and being very careful about how customers can interact with the models as a consequence). I assume people who have different incentives will take a different view, though.


> the model weights would be a derivative work of public domain materials, and hence also forced into the public domain from a copyright perspective.

I don’t think “Public domain” means what you think it means.


Yes, the person to whom you are responding appears to be mixing up "publicly available" (made available to general public) with "public domain" (not protected by copyright).

IANAL but, I think, as far as US law goes, they have the right conclusion for the wrong reasons. Unsupervised training is an automated process, and the US Copyright Office has said [0] that the product of automated processes can't be copyrighted. While that statement was focused on the output of running an AI model, not the output of its training process (the parameters), I can't see how – for a model produced by unsupervised training – the conclusion would be any different.

This is probably not the case in many non-US jurisdictions, such as the EU, UK, Australia, etc – all of which have far weaker standards for copyrightability than the US does. It may not apply for supervised training – the supervision may be sufficient human input for copyrightability even in the US. It may not apply for AI models trained from copyrighted datasets, where the copyright owner of the dataset is claiming ownership of the model – that is not the case for OpenAI/Google/Meta/etc, who are all using training datasets predominantly copyright by third parties, but maybe Getty Images will build their own Stable Diffusion-style AI based on their image library, and that might give them a way of copyrighting their model which OpenAI/Google/Meta/etc lack.

It is always possible that US Congress will amend the law to make AI parameters copyrightable, or introduce some sui generis non-copyright legal protection for them, like the semiconductor mask work rights which were legislated in response to court rulings that semiconductor masks could not be copyrighted. I think the odds are reasonably high they will in fact do that sooner or later, but nobody knows for certain how things will pan out.

[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...


> the product of automated processes can't be copyrighted.

That output could still be covered by copyright: In the case where the input is covered by copyright, the product/output may be considered a derived work, in which case the output is still covered by the same copyright the input was. Your argument just explains why the output will not gain any additional copyright coverage.


The EU and most of the world require human authorship too. The UK instead maintains the view the model's operator gets the copyright.


The copyright office already determined that AI artifacts are not covered by copyright protections. Any model created through unsupervised learning is this kind of artifact. At they same time they determined that creations that mix ai artifacts with human creation are covered by copyright protection.


Devil's Advocate: The EU comes down hard on any AI company that doesn't work with researchers and institutions in future.


Outright banning due to fear seems far more likely.


I mean it's a good power tool, cuts fast with little effort.

But what's it gonna do in the hands of your parents or kids.. when it gets thing wrong, its could have way worst impact if it's intergrated in government, health care, finance etc..




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: