Maybe by bureaucrats in the OSI. Meta through the Llama models have done more for open source LLMs than just about anyone else, which the community recognises.
Other industries have seen themselves "poisoned" by vendor-specific definitions of networking protocols, programming paradigms and languages, and so on. The concern is real. You can't just say "Well, I like this product so I approve of the vendor's actions."
The whole crux is to get an objective definition of what "open AI" is, and how it needs to behave. If others agree (or disagree) with the vendor that doesn't mean the vendor is permitted to make its unilateral de facto declaration a de jure consensus decision.
That’s a nice sentence you’ve written there, but what exactly is it supposed to mean?
Is it supposed to give the false impression that OpenAI and Meta’s actions have been equivalent with respect to releasing their models? And if it isn’t an attempt to mislead in such a way, then what exactly does it mean?
OpenAI is absolutely a misleading name because they have lax policies on human inputs (stealing everyone’s IP and not even letting you truly delete your account) but then turn around and say “developing models that compete” (vague as fuck) on OpenAI chatbot outputs is “illegal, harmful, or abusive” (if it is, they only act like it when it benefits them)
I guess they’re trying to call out OpenAI more than group them with Meta
You don't want AI data to be open? You don't want massive libraries of open knowledge spanning all cultures accessible finally to anyone in the world in any language?
Yes, some data is open and the creators of it want it to be shared openly.
Other people want their copyrights upheld and for there to be no intellectual property appropriation (theft) of their creations.
There is also the issue of aggregation. For example, Wikipedia is an open and free aggregation of knowledge. But it also cites, quotes, and links out to sources which may be copyrighted works.
Note that Wikipedia has very formal guidance as to how to avoid and handle instances of plagiarism.
AI needs to face the fact that it is rife with intellectual property theft. It needs standards for infringed copyright removal. It needs standards for the defense of trademarks. It needs privacy standards to prevent the unauthorized use of personal data, images, recordings and other works from people around the globe.
If AI is permitted to simply glut itself on every artistic creation and every human on the planet, sans limitations, it is basically the equivalent of a crime spree.
The gist of the article, which is roundly negative towards Meta and Mark Zuckerberg personally simply because he’s an easy target and they want to score some cheap points, is that they/he are actively causing harm by not releasing the data, training methods and everything else that goes into the creation and use of the models.
So, the choice is give up everything that gives you any advantage, and hence any incentive to develop anything in the first place, or you’re the bad guy.
I’m not sure how they think the world works. But I agree that we should be clear about our standards. So I’d suggest that those with these expectations adopt a new brand other than “open source”. Maybe “puritan approved”.
model available or open weights are right there. Nothing puritanical about it. the standards for calling something open source are relatively simple - you have to share the source to begin to call it open source. they have not done so, and asking them to use a different name for it when they're not sharing the source really doesn't seem like too much to ask.
Except it's Mark and Meta, which are household names, and I'm some Internet rando, so my voice is smaller, hence the accusations of bullying. If someone decides for you that your new name is shithead, and everyone goes along with it because that person's bigger than you, thats bullying.
I observe a weird patter in your replies - you twist the intention of OPs in the way that they didn't intend.
In this particular case, meta's models are not "open source" although Zucks calls them as such. And because of that, people also started to believe they are open source.
The poster in this case used a metaphor where calling Meta a bully to the open source community was justified by likening the situation to someone being forcibly renamed “shithead”, presumably by a bully on the schoolyard. I’m asking if he really thinks that’s an appropriate comparison. The answer of course is no, and if the answer is no, then why did they choose it? To cast the situation in a dishonestly negative light. So who’s twisting what? Maybe you need to get better at reading intentions.
I disagree with the parent's choice of using metaphors because it was imprecise and should express a lie rather than an insult. Nevertheless, the intent is easy to understand: the powerful twist the meanings of words and the weak can do nothing about it.
Please call it Open Weights and not Open Source. Open Souce AI will be nothing like Open Weight AI.
Open Source AI will be able to reference and share its underlying truth data. Imagine an Open Source AI trained on all GPL code or a Public Library and it could guide you directly to the middle of the source of any code / book it knows.
I agree with the other poster who said something to the effect that the model is open source and the released weights are, well, open weight. But the distinction is so trivial that I think it highlights the stupidity of this whole thing.
The distinction in this case is decidedly not trivial. Open weights are great, we can fine tune them and mold them to our needs. But we don't have the training code, training data, etc, to be able to reproduce or tweak things at a more fundamental level.
"But not everyone is going to spend the money or time to train their own models from scratch!" I hear you say. And there's some truth to that. But if we had truly open source LLMs, then AuroraGPT development would be fast tracked and we would have a fully government funded scientific frontier model instead of only fine-tuning models.
Actually, in the case of models, training data is functionally ~= code, as the final item cannot be reproduced without it. And, as the open source definition states in its criteria 2: "Intermediate forms such as the output of a preprocessor or translator are not allowed."[0]. The intermediate output in this case being the weights.
I don’t think open source authors have a responsibility to also open source their binary. And if they do happen to provide a binary they aren’t also required to open source their compiler. Usually just the source code. I personally feel if they opened source only the code and not any model weights, it would still be fair to call it open source because “open source” refers to the code itself, not artifacts produced by the code, or possible inputs to the code.
It's a matter of intention. The intention of open source is to allow for anyone to be able to sufficiently reproduce some given artifact, at no/reasonable charge (as stated in the definition). Yes, the source code for a binary must be made available, if is to follow the OSD. No, the compiler doesn't need to be open sourced, unless it's an integral part of the program (a compiler would also have to be made available if there's none generally available for the source language).
This intention necessarily translates to models. In order to reproduce a program, the source code and a relevant compiler is what is required. In order to reproduce a model, the architecture (source code), training data and initial parameters are required.
Trivial? ."the model is open source and the released weights are, well, open weight." I'm not sure why you don't want open source model data or think open source is trivial. This sentence makes no distinction or point.
Open source as it stands is insufficient to capture all parts of the pipeline.
With code, you just compile and run it. Models require carefully curated training data (lots of it), training code, final weights, production inference code, etc.
The danger is the world gets drunk on Meta's models, but then Meta pulls the plug. Meta is doing the world a lot of good with Llama, but we'll be left in a precarious position should they stop being so generous. The next generation of LLMs might make Llama completely pointless.
Meta thinks a great deal about the strategy in this space. They license other weights and models as CC-BY-NC and other non-free licenses because they know that they have the SOTA models across several categories and that there's no other competition in the marketplace. They want to retain their advantage when it benefits them.
So what happens when the LLM space shifts and Meta no longer needs to appear to be open source? What if the value switches over to diffusion models and away from LLMs, which is an area where Meta shines?
Llama is a gift we can't replicate or properly repair and extend. And we need to be careful.
The point is cultivating dependency under the color of "open source" is not generally wise.
Users of the "open source" LLM do not have all the software freedoms normally associated with "open source."
They cannot rebuild it and more. Important pieces are missing the "open" part.
Someone says, "here, take this awesome math library, it is open." Then users find out there is a brittle binary blob needed for the whole thing to work.
Nobody will want to build on or with it because that blob works, until it doesn't and when it does work, people are not sure what it does exactly.
Makes the whole thing a lot more like free beer.
In the scheme of things right now, particularly given both the pace of change, and up coming corruption due to feeding models their own output, anyone's latest may just not be relevant that long.
I don't care for the "bureaucrats in the OSI" nor the definition of open source, but your argument is complete nonsense.
The fact that llama is not open source does not make it less useful or less appreciated. Nor should anything be considered open source because it is useful or appreciated
Nice that the Economist picks this up. In addition to the article, which is 100% correct, Meta sucks the air out of the room with PyTorch, which stifles other true open source efforts. Instagram has way too much influence on Python, where a couple of companies have wrestled away control over the org from dozens of independent developers and push their pet projects of questionable quality.
So the open source community in this metaphor are the people screaming "That's not really nudity!" at the people having fun? Don't quit your day job journo.
I'm a little confused what the opening paragraph has to do with the rest of the article.
The OSI's definition is still new. In fact it's still in draft form so this debate is a bit premature. I suspect that companies will begin releasing the code to train the model once it is finalized (they will never tell you the training data due to legal reasons and forcing them to is a losing battle).
"Which raises the tantalising question: will Zuck ever have the pluck to bare it all?"
This last sentence gave me a headache. He does not need to publish a bunch of possibly proprietary data owned by Meta, so no he won't bare it all. Data isn't free.
What are the consequences as far as releasing the weights but not the training data? From what I understand, Llama and other open-weight models can be freely used and modified. What can people not do presently, but could do with an open-data model?
You can’t reproduce it from scratch. That’s about all. The fact that this is effectively impossible anyway without tens of millions in funding to pay for compute apparently doesn’t factor into it.
I don’t think this is an issue about open source, or pragmatism or utility or even ideology very much. There’s one big tech company that makes open source models available. Instead of rewarding this behaviour, some people think that attacking them for it instead is the best response.
I think this is a primarily social phenomenon which really just boils down to “meta bad” with various post hoc justifications tacked on.
I find this obsession with distinguishing between open weight and open source foolish and counterproductive.
The model architecture and infrastructure are open source. That is what matters.
The fact that you get really good weights that result from millions of dollars of GPU time on extremely expensive-to-procure proprietary datasets is amazing, but even that shouldn't be a requirement to call this open source. That is literally just an output of the open source model trained on non-open source inputs.
I find it absurd that if I create a model architecture, publish my source code, and slap an open source license on it, I can call that open source…but the moment I publish some weights that are the result of running the program on some proprietary dataset, all of a sudden I can’t call it open source anymore.
> I find it absurd that if I create a model architecture, publish my source code, and slap an open source license on it, I can call that open source…but the moment I publish some weights that are the result of running the program on some proprietary dataset, all of a sudden I can’t call it open source anymore.
Then you don't understand open source. You would be distributing something that could not be reproduced because its source was not provided. The source to the product would not be open. It's that simple. The same principle has always applied to images, audio and video. There's no reason for you to be granted a free pass just becauase it's a new medium.
1. I publish the source code to a program that inputs a list of numbers and outputs the sum into a text file. License is open source. Result according to you: this is an open source program.
2. Now, using that open source program, I also publish an output result text file after feeding it a long list of input numbers generated from a proprietary dataset. I even decide to publish this result and give it an open source license. Result according to you: this is NO LONGER AN OPEN SOURCE PROGRAM!!?!!
How does that make any fucking sense?
You have the model and can use it any way you want. The model can be trained any way you want. You can plug in random weights and random text input if you want to. That is an open source model. The act of additionally publishing a bunch of weights that you can choose to use if you want to should not make that model closed source.
Programs aren't the only things that can be open-source. LLMs are made of more than just programs. Some of those things that are not programs are not open source. If any part of something is not open source, the whole of that thing is not open source, per definition. Therefor any LLM that has any parts that are not open source is not open source, even if some parts of it are open source.
That simply isn't true. LLM weights are an output of a model training process. They are also an input of a model inference process. Providing weights for people to use does not change the source code in any way, shape, or form.
While a model does require weights in order to function, there is nothing about the model that requires you to use any weights provided by anybody, regardless of how they were trained. The model is open source. You can train a Llama 3.1 model from scratch on your proprietary collection of alien tentacle erotica, and you can do so precisely because the model is open source. You can claim that the weights themselves are not open source, but that says absolutely nothing about the model being open source. The weights distributed are simply not required.
But more importantly, under your definition, there will never exist in any form a useful open source set of weights. Because almost all data is proprietary. Anybody can train on large quantities of proprietary data without permission using fair use protections, but no matter what you can't redistribute it without permission. Any weights derived from training a model on data that can be redistributed by a single entity would inherently be so tiny that it would be almost useless. You could create a model with a few billion parameters that could memorize it all verbatim.
Open weights can be useful, and they can be a huge boon to users that don't have the resources to train large models, but they aren't required for any meaningful definition of open source.
> But more importantly, under your definition, there will never exist in any form a useful open source set of weights. Because almost all data is proprietary. Anybody can train on large quantities of proprietary data without permission using fair use protections, but no matter what you can't redistribute it without permission. Any weights derived from training a model on data that can be redistributed by a single entity would inherently be so tiny that it would be almost useless. You could create a model with a few billion parameters that could memorize it all verbatim.
That may very well be so. We'll see what the future holds for us.
The training data to create an LLM are as much a part of the LLM as the design notes and IDE use to create traditional software are a part of those projects.
I'm not sure how that relates to AI models. Freely distributing compiled binaries, but not the source code, means modification is extremely difficult. Effectively impossible without reverse-engineering expertise.
But I've definitely seen modifications to Llama3.1 floating around. Correctly me if I'm wrong, but open-weight models can be modified fairly easily.
You’re right obviously. In time none of these complaints are going to matter. Open weight is pragmatic and useful and will be accepted by basically everyone.
Imagine an "Open Source" & "Open Weight" model that could give you a HTTP link to the source of any idea it knows directly from its public training database.
The perfect is the enemy of the good, as usual.