It's amazing to me that "open source" has been so diluted that it is now used to mean "we will give you an opaque binary and permission to run it on your own computer."
Yes training is left as an exercise to the user, but it's outlined in the paper, and a good ML engineer should be able to get started with it, cluster of GPUs not included
There was an article saying they used hand-tuned PMX instead of CUDA so it might be a bit hard to match just from the paper without some good performance experts.
CUDA isn't so bad that hand writing PTX will give you a huge performance improvement, but when you're spending a few million dollars on training it makes sense to chase even a single digit percentage improvement, maybe more in a very hot code-path. Also these articles are based on a single mention of PTX in a paper.
"3.2.2. Efficient Implementation of Cross-Node All-to-All Communication
In order to ensure sufficient computational performance for DualPipe, we customize efficient
cross-node all-to-all communication kernels (including dispatching and combining) to conserve
the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific,
in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications
are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB
(50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each
token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its
routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node
index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is
instantaneously forwarded via NVLink to specific GPUs that host their target experts, without
being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink
are fully overlapped, and each token can efficiently select an average of 3.2 experts per node
without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3
13
selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts
(4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under
such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB
and NVLink.
In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition
20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2)
IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The
number of warps allocated to each communication task is dynamically adjusted according to the
actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending,
(2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also
handled by dynamically adjusted warps. In addition, both dispatching and combining kernels
overlap with the computation stream, so we also consider their impact on other SM computation
kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and
auto-tune the communication chunk size, which significantly reduces the use of the L2 cache
and the interference to other SMs."
It's definitely not the full model written in PTX or anything, but still some significant engineering effort to replicate, from people commanding 7-figure salaries in this wave, since the training code isn't open.
And it really isn't so surprising to have to go 'down' to PTX for such a low-level optimisation. For all the love AVX512 articles get here, I, for one, am glad some people talk about their PTX secret sauce.
I wish Intel didn't kill the maxas effort back then (buying the team out and ... what ?) as they were going even lower down the stack.
To me, this feels the same as saying Sonic Colors Ultimate is open source because it was made with Godot. The engine is open source and making the game is left as an exercise to the user.
But you have all the assets of the actual finished game as well as the code used to run it, using your example. You don't get the game dev studio, i.e. datasets, expertise, and compute. Just because someone gives you all the source code and methods they used to make a game, doesn't mean anyone can just go and easily make a sequel, but it helps.
Very few entities publish the later two items (https://huggingface.co/blog/smollm and https://allenai.org/olmo come to mind). Arguably, publishing curated large scale pretraining data is very costly but publishing code to automatically curate pretraining data from uncurated sources is already very valuable.
Also open-weights comes in several flavors -- there is "restricted" open-weights like Mistral's research license that prohibits most use cases (most importantly, commercial applications), then there are licenses like Llama's or DeepSeek's with some limitations, and then there are some Apache 2.0 or MIT licensed model weights.
Has it been established if the weights can even be copyrighted? My impression has been that AI companies want to have their cake and it it too, on one hand they argue that the models are more like a database in a search engine, hence are not violating copyright of the data they have been trained with, but on the other hand they argue they meet the threshold that they are copyrightable in their own right.
So it seems to me that it's at least dubious if those restricted licences can be enforced (that said you likely need deep pockets to defend yourself from a lawsuit)
Then those should not be considered “open” in any real sense—when we say “open source,” we’re talking about the four freedoms (more or less—cf. the negligible difference between OSI and FSF definitions).
So when we apply the same principles to another category, such as weights, we should not call things “open” that don’t grant those same freedoms. In the case of this research license, Freedom 0 at least is not maintained. Therefore, the weights aren’t open, and to call them “open” would be to indeed dilute the meaning of open qua open source.
Wow. Your link is frustrating because I thought everything was under the
MIT license. Why did people claim it is MIT licensed if they sneaked in this additional license?
I can't be 100% certain, but I think the good news is: no. There seem to be the exact same number of safetensor files for both, and AFAICT the file sizes are identical.
If I publish some c++ code that has some hard-coded magic values in it, can the code not be considered open source until I also publish how I came up with those magic values?
It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.
If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.
I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)
> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.
Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:
- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.
- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.
- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.
And so on.
Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.
if you publish only the binary it's not open source
if open the source then it is open source
if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries
I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.
I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.
The Open Source Definition is quite clear on its #2 requirement:
`The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.`
https://opensource.org/osd
Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.
Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.
It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.
IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.
It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.
It’s because prominent people with large followings are confusing the terms on purpose. Yann LeCun of Meta and Clem Delangue of Hugging Face constantly use the wrong terms for models that only release weights, and market them to their huge audiences as “open source”. This is a willful open washing campaign to benefit from the positivity that label generates.
I agree it would be nice to have the training specifics. Nevertheless everything DeepSeek released is under the MIT license right? So you can go set up a cloud LLM, fine tune it, and, do whatever else you wish with it right? That is pretty significant no?
Except the "binary" is not really opaque, and can be "edited" in exactly the same way it was produced in the first place (continued pre-training / fine-tuning).
Even with the training material what good is it? The model isn’t reproducible, and even if it were you’re not going to spend the money to verify the output.
Frontier models will never be reproducible in the freedom-loving countries that enforce intellectual property law, since they all depend on copyrighted content in their training data.
why not? if we could get a version of ChatGPT that wasn't censored and would tell me how to make meth, or an censored version of deepseek that wanted to talk about tank man, you don't think the Internet would come together and make that happen?
> amazing to me that "open source" has been so diluted
It’s not and I called it [1].
We had three options: (A) Open weights (favoured by Altman et al); (B) Open training data (favoured by some FOSS advocates); and (C) Open weights and model, which doesn’t provide the training data, but would let you derive the weights if you had it.
OSI settled on (C) [2], but it did so late. FOSS argued for (B), but it’s impractical. So the world, for a while, had a choice between impractical (B) and the useful-if-flawed (A). The public, predictably, went with the pragmatic.
This was Betamax vs VHS, except in natural linguistics. There is still hope for (C). But it relies on (A) being rendered impractical. Unfortunately, the path to that flows through institutionalising OpenAI et al’s TOS-based fair use paradigm. Which means while we may get a definition (not exactly (B), but (A) absent use restrictions) we’ll also get restrictions on even using Chinese AI.
We absolutely had a choice (D), in that no one was forced to call it "open source" at all, which was arguably done to unfaithfully communicate benefits that don't exist. This is the part that riles people up, and that furthermore is causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS.
If you want to prioritize pragmatism, that every discussion of this includes a lengthy "so what open source do you mean, exactly?" subthread proves this was a poor choice. It causes uncertainly that also makes it harder for the folks releasing these models to make their case and be taken seriously for their approach.
We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture" to appreciate the Python file that utilizes the weights more.
> We absolutely had a choice (D), in that no one was forced to call it "open source" at all
Technically yes, practically no.
You’re describing a prisoner’s dilemma. The term was available, there was (and remains) genuine ambiguity over what it meant in this context, and there are first-mover advantages in branding. (Exhibit A: how we label charges).
> causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS
Standards wars have collateral damage.
> We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture"
Language is parsimonious. A neologism will never win when a semantic shift will do.
> Language is parsimonious. A neologism will never win when a semantic shift will do.
Agreed, but I think it's worth lamenting the danger in that. History is certainly full of transitory calamity and harm when semantic shifts detach labels from reality.
I guess we're in any case in "damage is done" territory. The question is more about where to go next. It does appear that the term "open source" isn't working for what these folks are doing (you could even argue whether the "available" term they chose was a strong one to lean on in the first place), so we'll see what direction the next shift takes.
The source code is absolutely open which is the traditional meaning of open source. You are wanting to expand this to include data sets, which is fine, but that is the divergence.
Nonono the code for (pre-)training wasn't released either and is non trivial to replicate. Releasing the weights without the dataset and training code is equivalent of releasing a binary executable and calling it open source. Freeware would be more accurate terminology.
I think I see what you mean. I suppose it is kinda like an opaque binary, nevertheless, you can use it freely since all is under the MIT license right?
Yes even for commercial purposes which is great, but the point of and reason why "open source" became popular is that you can modify the underlying source code of the binary which you can then recompile with your modifications included (as well as selling/publishing your modifications). You can't do that with deepseek or most other LLMs that claim to be open source. The point isn't that this makes it bad, the point is we shouldn't call it open source because we shouldn't loose focus on the goal of a truly open source (or free software) LLM on the same level than chatgpt/o1.
You can modify the weights which is exactly what they do when training initially. You do not even need to do it in exactly the same fashion. You could change things such as the optimizer and it would still work. So in my opinion it is nothing like an opaque binary. It's just data.
We have the weights and the code for inference, in the analogy this is an executable binary. We are missing the code and data for training, that's the "source code".
Then it’s never distributable and any definition of open source requiring it to be is DOA. It’s interesting, as an argument against copyright. But that academic.
it's not academic. Why can't ChatGPT tell me how to make meth? why doesn't deepseek want to talk about tiananmen square? what other things has the model been molested into how it should be? without the full source, we don't know
While I appreciate the argument that the term "open source" is problematic in the context of AI models, I think saying the training data is the "source code" is even worse, because it broadens the definition to be almost meaningless. We never considered data to be source code and realistically for 99.9999% of users the training data is not the preferred way of modifying the model, just because the don't have millions of $ to retrain the full model, they likely don't even have the HDD space to save the training data.
Also I would say arguing that the model weights are just the "binary" is disingenuous, because nobody wants releases that only contain the training data and scripts to train and not the model weights (which would be perfectly fine for open source software if we argue that the weights are just the binaries), because they would be useless to almost everyone, because they don't have the resources to train the model.
When you can use LLMs to write code with English (or other) language, it's pretty disingenuous to not call the training data source code just because it's not exclusively written in a programming language like Python or C++.