It's amazing to me that "open source" has been so diluted that it is now used to...

mcbuilder · 2025-01-29T16:23:45 1738167825

Surely the architecture released as a HF transformers python file counts as "open source". https://huggingface.co/deepseek-ai/DeepSeek-R1/raw/main/mode...

Yes training is left as an exercise to the user, but it's outlined in the paper, and a good ML engineer should be able to get started with it, cluster of GPUs not included

cma · 2025-01-29T17:08:56 1738170536

There was an article saying they used hand-tuned PMX instead of CUDA so it might be a bit hard to match just from the paper without some good performance experts.

LiamPowell · 2025-01-29T17:54:35 1738173275

CUDA isn't so bad that hand writing PTX will give you a huge performance improvement, but when you're spending a few million dollars on training it makes sense to chase even a single digit percentage improvement, maybe more in a very hot code-path. Also these articles are based on a single mention of PTX in a paper.

cma · 2025-01-29T20:16:18 1738181778

The mention is here:

"3.2.2. Efficient Implementation of Cross-Node All-to-All Communication

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3 13 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs."

It's definitely not the full model written in PTX or anything, but still some significant engineering effort to replicate, from people commanding 7-figure salaries in this wave, since the training code isn't open.

touisteur · 2025-01-31T09:52:59 1738317179

And it really isn't so surprising to have to go 'down' to PTX for such a low-level optimisation. For all the love AVX512 articles get here, I, for one, am glad some people talk about their PTX secret sauce.

I wish Intel didn't kill the maxas effort back then (buying the team out and ... what ?) as they were going even lower down the stack.

squeaky-clean · 2025-01-29T18:28:18 1738175298

To me, this feels the same as saying Sonic Colors Ultimate is open source because it was made with Godot. The engine is open source and making the game is left as an exercise to the user.

mcbuilder · 2025-01-29T19:02:48 1738177368

But you have all the assets of the actual finished game as well as the code used to run it, using your example. You don't get the game dev studio, i.e. datasets, expertise, and compute. Just because someone gives you all the source code and methods they used to make a game, doesn't mean anyone can just go and easily make a sequel, but it helps.

HPsquared · 2025-01-29T19:22:05 1738178525

In other words you don't have the source data.

zelphirkalt · 2025-01-30T11:01:18 1738234878

And full circle would be "code is data (lisp), data is code (forth)".

ogrisel · 2025-01-29T16:19:36 1738167576

It's better to be specific:

- open-source inference code

- open weights (for inference and fine-tuning)

- open pretraining recipe (code + data)

- open fine-tuning recipe (code + data)

Very few entities publish the later two items (https://huggingface.co/blog/smollm and https://allenai.org/olmo come to mind). Arguably, publishing curated large scale pretraining data is very costly but publishing code to automatically curate pretraining data from uncurated sources is already very valuable.

Palmik · 2025-01-29T17:33:01 1738171981

Also open-weights comes in several flavors -- there is "restricted" open-weights like Mistral's research license that prohibits most use cases (most importantly, commercial applications), then there are licenses like Llama's or DeepSeek's with some limitations, and then there are some Apache 2.0 or MIT licensed model weights.

cycomanic · 2025-01-29T19:46:45 1738180005

Has it been established if the weights can even be copyrighted? My impression has been that AI companies want to have their cake and it it too, on one hand they argue that the models are more like a database in a search engine, hence are not violating copyright of the data they have been trained with, but on the other hand they argue they meet the threshold that they are copyrightable in their own right.

So it seems to me that it's at least dubious if those restricted licences can be enforced (that said you likely need deep pockets to defend yourself from a lawsuit)

jcgl · 2025-01-29T17:57:45 1738173465

Then those should not be considered “open” in any real sense—when we say “open source,” we’re talking about the four freedoms (more or less—cf. the negligible difference between OSI and FSF definitions).

So when we apply the same principles to another category, such as weights, we should not call things “open” that don’t grant those same freedoms. In the case of this research license, Freedom 0 at least is not maintained. Therefore, the weights aren’t open, and to call them “open” would be to indeed dilute the meaning of open qua open source.

seberino · 2025-01-29T17:40:59 1738172459

Wait timeout. I thought DeepSeek's stuff was all MIT licensed too no? What limitations are you thinking of that DeepSeek still has?

Palmik · 2025-01-29T17:47:11 1738172831

I am referring to this one: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LIC...

It is a bit more permissive than Llama's it seems (no MAU threshold it seems).

seberino · 2025-01-30T15:29:19 1738250959

Wow. Your link is frustrating because I thought everything was under the MIT license. Why did people claim it is MIT licensed if they sneaked in this additional license?

orra · 2025-01-31T08:31:49 1738312309

So, the older DeepSeek-V3 model weights are sadly not permissively licensed.

But the recent DeepSeek-R1-Zero and DeepSeek-R1 have MIT licensed weights.

seberino · 2025-02-01T15:48:14 1738424894

Thank you very much. That was helpful. Do we need the older model weights to use the recent DeepSeek-R1-Zero and DeepSeek-R1 models?

orra · 2025-02-02T12:26:46 1738499206

I can't be 100% certain, but I think the good news is: no. There seem to be the exact same number of safetensor files for both, and AFAICT the file sizes are identical.

https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main

paxys · 2025-01-29T16:07:14 1738166834

Haha yup. Going by the current definition of "open source" in AI 100% of software created before the cloud era would have been considered open source.

mFixman · 2025-01-29T16:36:07 1738168567

I can't believe Microsoft finally made Windows open source.

zelphirkalt · 2025-01-30T11:05:17 1738235117

Yay, let's "fine-tune" and share the result with everyone!

desdenova · 2025-01-29T16:55:32 1738169732

Every binary is open source if you can read assembly.

hexomancer · 2025-01-29T16:10:53 1738167053

If I publish some c++ code that has some hard-coded magic values in it, can the code not be considered open source until I also publish how I came up with those magic values?

bityard · 2025-01-29T16:35:42 1738168542

It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.

If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.

I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)

Ukv · 2025-01-29T16:51:17 1738169477

> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.

[0]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inferen...

[1]: https://arxiv.org/html/2412.19437v1

TeMPOraL · 2025-01-29T23:25:23 1738193123

Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:

- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.

- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.

- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.

And so on.

Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.

mohsen1 · 2025-01-29T16:18:33 1738167513

if you publish only the binary it's not open source

if open the source then it is open source

if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries

mistercheph · 2025-01-29T17:45:28 1738172728

model weights != binaries

fragmede · 2025-01-29T19:12:02 1738177922

why not?

jay_kyburz · 2025-01-29T20:26:48 1738182408

Its like the image you generated in Photoshop released as creative commons, not the Photoshop source code.

fragmede · 2025-01-29T20:33:22 1738182802

that adds to model weights == binaries tho

z3c0 · 2025-01-29T16:18:13 1738167493

I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.

I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.

reedciccio · 2025-01-29T17:42:58 1738172578

The Open Source Definition is quite clear on its #2 requirement: `The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.` https://opensource.org/osd

ChadNauseam · 2025-01-29T17:53:00 1738173180

Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.

Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.

jonex · 2025-01-29T18:53:54 1738176834

It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.

IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.

DonHopkins · 2025-02-02T04:55:20 1738472120

It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.

blackeyeblitzar · 2025-01-29T16:13:39 1738167219

It’s because prominent people with large followings are confusing the terms on purpose. Yann LeCun of Meta and Clem Delangue of Hugging Face constantly use the wrong terms for models that only release weights, and market them to their huge audiences as “open source”. This is a willful open washing campaign to benefit from the positivity that label generates.

seberino · 2025-01-29T17:44:28 1738172668

I agree it would be nice to have the training specifics. Nevertheless everything DeepSeek released is under the MIT license right? So you can go set up a cloud LLM, fine tune it, and, do whatever else you wish with it right? That is pretty significant no?

fragmede · 2025-01-29T19:49:45 1738180185

It is, but words mean things. If I said I got you a puppy and gave you a million dollars instead, that'd be nice, but what about the puppy?

Palmik · 2025-01-29T17:30:57 1738171857

Except the "binary" is not really opaque, and can be "edited" in exactly the same way it was produced in the first place (continued pre-training / fine-tuning).

cedws · 2025-01-29T16:21:15 1738167675

Even with the training material what good is it? The model isn’t reproducible, and even if it were you’re not going to spend the money to verify the output.

barnabee · 2025-01-29T19:14:03 1738178043

> The model isn’t reproducible

Not necessarily[0], it's a WIP, but: https://github.com/huggingface/open-r1

[0] Surely they won't end up with the exact same weights, but it should be possible to verify something about the model and approach

mistercheph · 2025-01-29T17:47:20 1738172840

Frontier models will never be reproducible in the freedom-loving countries that enforce intellectual property law, since they all depend on copyrighted content in their training data.

deegles · 2025-01-29T16:24:31 1738167871

I guess something like a kickstarter campaign would be needed to get together the millions of dollars needed per training run

Gigachad · 2025-01-30T08:23:02 1738225382

Who would fund that? What would be the point?

fragmede · 2025-01-29T19:19:33 1738178373

why not? if we could get a version of ChatGPT that wasn't censored and would tell me how to make meth, or an censored version of deepseek that wanted to talk about tank man, you don't think the Internet would come together and make that happen?

seberino · 2025-01-29T17:36:48 1738172208

I'm not an expert but didn't they release the weights under MIT license? So you can make your own LLM with complete control right?

I agree it would nice to know the details of their training, but, simply calling this drop an "opaque binary" is seriously underselling it no?

dartos · 2025-01-29T16:08:12 1738166892

Yeah blame the crowds of newbies calling llama open source bc it was free after being leaked.

JumpCrisscross · 2025-01-29T16:27:29 1738168049

> amazing to me that "open source" has been so diluted

It’s not and I called it [1].

We had three options: (A) Open weights (favoured by Altman et al); (B) Open training data (favoured by some FOSS advocates); and (C) Open weights and model, which doesn’t provide the training data, but would let you derive the weights if you had it.

OSI settled on (C) [2], but it did so late. FOSS argued for (B), but it’s impractical. So the world, for a while, had a choice between impractical (B) and the useful-if-flawed (A). The public, predictably, went with the pragmatic.

This was Betamax vs VHS, except in natural linguistics. There is still hope for (C). But it relies on (A) being rendered impractical. Unfortunately, the path to that flows through institutionalising OpenAI et al’s TOS-based fair use paradigm. Which means while we may get a definition (not exactly (B), but (A) absent use restrictions) we’ll also get restrictions on even using Chinese AI.

[1] https://news.ycombinator.com/item?id=41047269

[2] https://opensource.org/ai/open-source-ai-definition

sho_hn · 2025-01-29T16:32:44 1738168364

We absolutely had a choice (D), in that no one was forced to call it "open source" at all, which was arguably done to unfaithfully communicate benefits that don't exist. This is the part that riles people up, and that furthermore is causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS.

If you want to prioritize pragmatism, that every discussion of this includes a lengthy "so what open source do you mean, exactly?" subthread proves this was a poor choice. It causes uncertainly that also makes it harder for the folks releasing these models to make their case and be taken seriously for their approach.

We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture" to appreciate the Python file that utilizes the weights more.

JumpCrisscross · 2025-01-29T16:38:51 1738168731

> We absolutely had a choice (D), in that no one was forced to call it "open source" at all

Technically yes, practically no.

You’re describing a prisoner’s dilemma. The term was available, there was (and remains) genuine ambiguity over what it meant in this context, and there are first-mover advantages in branding. (Exhibit A: how we label charges).

> causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS

Standards wars have collateral damage.

> We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture"

Language is parsimonious. A neologism will never win when a semantic shift will do.

sho_hn · 2025-01-29T16:41:56 1738168916

> Language is parsimonious. A neologism will never win when a semantic shift will do.

Agreed, but I think it's worth lamenting the danger in that. History is certainly full of transitory calamity and harm when semantic shifts detach labels from reality.

I guess we're in any case in "damage is done" territory. The question is more about where to go next. It does appear that the term "open source" isn't working for what these folks are doing (you could even argue whether the "available" term they chose was a strong one to lean on in the first place), so we'll see what direction the next shift takes.

JumpCrisscross · 2025-01-29T16:50:21 1738169421

> we're in any case in "damage is done" territory. The question is more about where to go next

Sort of. We can learn from the example. Perfect is the enemy of the good.

nightski · 2025-01-29T17:22:52 1738171372

The source code is absolutely open which is the traditional meaning of open source. You are wanting to expand this to include data sets, which is fine, but that is the divergence.

lyu07282 · 2025-01-29T17:33:47 1738172027

Nonono the code for (pre-)training wasn't released either and is non trivial to replicate. Releasing the weights without the dataset and training code is equivalent of releasing a binary executable and calling it open source. Freeware would be more accurate terminology.

seberino · 2025-01-29T17:47:25 1738172845

I think I see what you mean. I suppose it is kinda like an opaque binary, nevertheless, you can use it freely since all is under the MIT license right?

lyu07282 · 2025-01-29T18:01:00 1738173660

Yes even for commercial purposes which is great, but the point of and reason why "open source" became popular is that you can modify the underlying source code of the binary which you can then recompile with your modifications included (as well as selling/publishing your modifications). You can't do that with deepseek or most other LLMs that claim to be open source. The point isn't that this makes it bad, the point is we shouldn't call it open source because we shouldn't loose focus on the goal of a truly open source (or free software) LLM on the same level than chatgpt/o1.

nightski · 2025-01-29T18:03:57 1738173837

You can modify the weights which is exactly what they do when training initially. You do not even need to do it in exactly the same fashion. You could change things such as the optimizer and it would still work. So in my opinion it is nothing like an opaque binary. It's just data.

lyu07282 · 2025-01-29T18:41:17 1738176077

We have the weights and the code for inference, in the analogy this is an executable binary. We are missing the code and data for training, that's the "source code".

JumpCrisscross · 2025-01-29T19:27:23 1738178843

> that's the "source code"

Then it’s never distributable and any definition of open source requiring it to be is DOA. It’s interesting, as an argument against copyright. But that academic.

fragmede · 2025-01-29T20:24:00 1738182240

it's not academic. Why can't ChatGPT tell me how to make meth? why doesn't deepseek want to talk about tiananmen square? what other things has the model been molested into how it should be? without the full source, we don't know

cycomanic · 2025-01-29T21:50:46 1738187446

While I appreciate the argument that the term "open source" is problematic in the context of AI models, I think saying the training data is the "source code" is even worse, because it broadens the definition to be almost meaningless. We never considered data to be source code and realistically for 99.9999% of users the training data is not the preferred way of modifying the model, just because the don't have millions of $ to retrain the full model, they likely don't even have the HDD space to save the training data.

Also I would say arguing that the model weights are just the "binary" is disingenuous, because nobody wants releases that only contain the training data and scripts to train and not the model weights (which would be perfectly fine for open source software if we argue that the weights are just the binaries), because they would be useless to almost everyone, because they don't have the resources to train the model.

JumpCrisscross · 2025-01-29T17:29:43 1738171783

> source code is absolutely open

It’s ambiguously open.

HPsquared · 2025-01-29T19:26:42 1738178802

Data is code, code is data.

DonHopkins · 2025-02-02T05:05:13 1738472713

When you can use LLMs to write code with English (or other) language, it's pretty disingenuous to not call the training data source code just because it's not exclusively written in a programming language like Python or C++.