The question is am I provided the build source that constructed these files. Mistral did not hand edit these files to construct them, there's source out there that built them.
Like, come on, a 14GB of a dump of mainly numbers that were constructed algorithmically are not "source".
> Like, come on, a 14GB of a dump of mainly numbers that were constructed algorithmically are not "source".
So if I take a photo of a pretty sunset, release it under MIT license, you'd say it's "not open source" unless I give you the sun and the atmosphere themselves?
These models are perfectly valid things in their own right; the can be fine-tuned or used as parts of other things.
For most of these LLMs (not sure about this one in particular yet) the energy cost in particular of recreation is more than most individuals earn in a lifetime, and the enormous data volume is such that the only people who seriously need this should be copyright lawyers and they should be asking for it to be delivered by station wagon.
I said "constructed algorithmically". Ie. I expect source to be at the level the engineers who built it generally worked at.
It's very nice that they released their build artifacts. It's great that you can take that and make small modifications to it. That doesn't make it open source.
> For most of these LLMs (not sure about this one in particular yet) the energy cost in particular of recreation is more than most individuals earn in a lifetime, and the enormous data volume is such that the only people who seriously need this should be copyright lawyers and they should be asking for it to be delivered by station wagon.
All of that just sounds like reasons why it's not practical to open source it, not reasons why this release was open source.
> I said "constructed algorithmically". Ie. I expect source to be at the level the engineers who built it generally worked at.
I could either point out that JPEG is an algorithm, or ask if you can recreate a sunset.
> All of that just sounds like reasons why it's not practical to open source it
No, they're reasons why the stuff you want doesn't matter.
If you can actually afford to create a model of your own, you don't need to ask: the entire internet is right there. Some of it even has explicitly friendly licensing terms.
An LLM with a friendly license is something you can freely integrate into other things which need friendly licensing. That's valuable all by itself.
You could train it from scratch on The Pile dataset[1] with a few hundred thousand bucks worth of GPU quota. It's not rocket science - the architecture is, and that's open source by your definition.
The graph of layers and ops isn't open source by my definition. It can be extracted from the model, but so can control graphs out of any binary. That's how higher end disassemblers work like IDA and ghidra.
Once again, this pickle file is not what's sitting in Mistral's engineer's editors as they go about their day.
Well the checkpoint __is__ the computational graph. The graph is also all the code. But if you want it in python... that's here[0].
Please be clear, we keep asking. What are you asking for? Datasets? Training algo? What?
Comparing it to software artifacts isn't a good comparison when any program with open source code (visible or free to use) is equivalent to what's being given. You have everything you need to use, edit, and fuck around with. You don't have the exact scheme, but I'll put it this way, if you gave me the hardware I could produce a LLM of high quality from scratch using their architecture.
That doesn't conflict with anything I've said. Yes, the checkpoint is code. It's not source code.
It's not what Mistral's engineers edit to create this release. Just like an ELF file is necessarily contains the code flow graph, in a way extractable by experts, but isn't open source because... it's not source.
The permissiveness of the license with regards to use isn’t the crux of the argument.
The open source family of licenses are about freedom. If I’m not given the tools to recreate a model, then I’m not afforded the freedoms normally associated with these open licenses. Really there’s little difference between Apache and CC-BY here.
> So if I take a photo of a pretty sunset, release it under MIT license, you'd say it's "not open source" unless I give you the sun and the atmosphere themselves?
You've gotta give me the stuff you used to make it, the stuff you'd want to have if you wanted to recreate a slightly different version of the photo ("in the preferred form for making modifications", as the GPL says). If you just snapped a photo of whatever you saw with whatever camera was in your pocket, then there's nothing else to publish. But if you figured out a timetable of when you should stand where with what kind of lens, then making your photo open-source would mean publishing that timetable.
> These models are perfectly valid things in their own right; the can be fine-tuned or used as parts of other things.
If the original creator can edit them, and you can't, then that's not open-source; fine-tuning is a help but someone who can only fine-tune is still a second-class user compared to the original developer. The whole point of open source is to put you on an equal footing with the original developer (in particular, to make sure that you can fix bugs by yourself and are never stuck waiting for them to release an update that you need).
> So if I take a photo of a pretty sunset, release it under MIT license, you'd say it's "not open source" unless I give you the sun and the atmosphere themselves?
Phorographs are not source code, are not computer code at all, and do not have a close analog to source code. Calling them “open source” is, at best, a poor and distant metaphor in any case, and so they aren't a useful model for discussing what open source means for software.
There's a very reasonable argument that model weights are an IL or object-code like artifact with training data and training source code together as the source code.
That doesn't change that the MIT license is an open source license, but when what you release under that license isn’t the whole source aoplicable to the model, but just inference and maybe training code but not the data needed to produce the weights, and final weights, then it is fair to question whether as a whole the model is open source.
> So if I take a photo of a pretty sunset, release it under MIT license, you'd say it's "not open source" unless I give you the sun and the atmosphere themselves?
Open source as a concept doesn't really apply to quite a lot of things. Your MIT-licensed photograph is "not open source" in the same way that `{} * {}` is "not a number" (it technically isn't, but that's not quite what NaN is supposed to mean).
> a 14GB of a dump of mainly numbers that were constructed algorithmically are not "source".
I'm sorry, but what do you expect? Literally all code is "a bunch of numbers" when you get down to it. Realistically we're just talking about if the code/data is 1) able to be read through common tools and common formats and 2) can we edit, explore, and investigate it. The answer to both these questions is yes. Any parametric mathematical model is defined by its weights as well as its computational graph. They certainly provide both of these.
What are we missing? The only thing that is missing here is the training data. That means of course that you could not reproduce the results were you to also have tens of thousands to millions of dollars to do so. Which if you're complaining about that then I agree, but this is very different from what you've said above. They shouldn't be providing the dataset, but they should be at least telling us what they used and how they used it. I would agree that it's not full "open source" when the the datasets are unknown and/or unavailable (for all intents and purposes, identical). The "recipe" is missing, yes, but this is very different from what you're saying. So if there's miscommunication then let's communicate better instead of getting upset at one another. Because 14G of a bunch of algorithmically constructed numbers and a few text tiles is definitely all you need to use, edit, and/or modify the work.
Edit: I should also add that they don't provide any training details. This model is __difficult__ to reproduce. Not impossible, but definitely would be difficult. (within some epsilon, because models are not trained in deterministic manners, so training something in identical ways twice usually ends up with different results)
> I'm sorry, but what do you expect? Literally all code is "a bunch of numbers" when you get down to it. Realistically we're just talking about if the code/data is 1) able to be read through common tools and common formats and 2) can we edit, explore, and investigate it. The answer to both these questions is yes. Any parametric mathematical model is defined by its weights as well as its computational graph. They certainly provide both of these.
I expect that if you call a release "open source", it's, you know, source. That their engineers used to build the release. What Mistral's engineers edit and collate as their day job.
> The "recipe" is missing, yes, but this is very different from what you're saying.
The "recipe" is what we generally call source.
> So if there's miscommunication then let's communicate better instead of getting upset at one another.
Who's getting upset here? I'm simply calling for not diluting a term. A free, permissive, binary release is great. It's just not open source.
> Because 14G of a bunch of algorithmically constructed numbers and a few text tiles is definitely all you need to use, edit, and/or modify the work.
Just like my Windows install ISO from when they were giving windows licenses away from free.
Not really. At least not in normal software. The recipe is honestly only really interesting to researchers (like me). But for building and production stuff, you have everything you need.
> Just like my Windows install ISO from when they were giving windows licenses away from free.
I repeat, a free windows ISO doesn't have an Apache license attached. This is an inane comparison.
Yes, in normal software, the 'recipe' used to create the build artifacts is the source.
> The recipe is honestly only really interesting to researchers (like me). But for building and production stuff, you have everything you need.
A lot of excuses about why it's not useful to a lot of people and that you don't actually need it to use it in production is exactly the argument made for why not open sourcing is ok. That doesn't mean this is a source release, open or otherwise.
> I repeat, a free windows ISO doesn't have an Apache license attached. This is an inane comparison.
Even if the Windows binary iso was released with an Apache license, it wouldn't be open source since no actual source was released. That's the point of that line of argument.
Like if someone only gave me binaries of Apache and said it's open source, you wouldn't agree with them. Cause you can't practically modify it. Open source of a model would be exact same, you would need to be able to do the same build process. A model is literally the same as an EXE or Elf Binary, its a "probabilistic" program, but its still a program
This is not a novel discussion and you are not being smart trying to nihilism it, just obtuse. Here is what the GPL has said on source for some 30+ years:
> Source code for a work means the preferred form of the work for making modifications to it.
The whole point of machine learning is deriving an algorithm from data. This is the algorithm they derived. It's open source. You can use it or change it. Having the data that was used to derive it is not relevant.
But the source to train your own LLM equivalent is also released though (minus the data). Hence why there are so many variants of LLaMa. You also can’t fine tune it without the original model structure. The weights give the community a starting point so they don’t need literally millions of dollar worth of compute power to get to the same step.
> Would Mistral's engineers be satisfied with the release if they had to rebuild from scratch?
Yeah, probably. But depends on what you're asking. In the exact same method to get the exact same results down to epsilon error? (Again, ML models are not deterministic) Probably not. This honestly can even change with a different version of pytorch, but yes, knowing the HPs would help get closer.
But to train another 7B model of the __exact__ same architecture? Yeah, definitely they've provided all you need for that. You can take this model and train it from scratch on any data you want and train it in any way you want.
I didn't ask if they'd be able to make do. I asked if they'd be satisfied.
Also, wrt
> Again, ML models are not deterministic)
ML models are absolutely deterministic if you have the discipline to do so (which is necessary at higher scale ML work when hardware is stochastically flaky).
But they built a llama equivalent + some enhancements that gives better performance…I’m not sure if this would be possible at all without Meta releasing all the required code and paper for LLaMa to begin with.
"Open source" normally contains what the engineers working on it edit to submit a job to build into the build artifacts. This does not contain that, but instead the result of the build process.
That’s because you’re failing to differentiate a build process meaning compiling your stuff to an end result, vs. fine tuning an algorithm to refine code.
I regard the distinction as meaningless and haven't heard a good reason as why I should reconsider when the process behind creating the weights is so integral to the overall engineering process here.
Well fortunately that’s just your opinion and it is neither popular nor relevant.
To quote stallman
> The "source code" for a work means the preferred form of the work for making modifications to it
That’s what this is. This is not a work that produces weights. So the code related to making weights would not be the source code of this project. It would be the source code of the tools used to make this project.
That quote is specifically designed to back up my position, hence why I've quoted it else where in this thread.
This is evidenced by the fact that if Mistral's engineers had to make modifications to this model, they would use other code to do so. This model is not the "preferred form of the work for making modifications to it", but simply the best we have in a world that isn't used open sourcing these models.
> This is evidenced by the fact that if Mistral's engineers had to make modifications to this model, they would use other code to do so.
It’s a PyTorch model. There is no secret internal code required to interact with it. You can write whatever you want. Unless you’re implying that in order to call this model open source they need to get all of PyTorch on the same license. You can write whatever you want to edit the weights.
Open source does not require babysit walking people through the basic competency steps of understanding how the tools of this domain work.
This model is 100% the preferred form of work for making modifications to it as evidenced by the large community oriented around sharing and adapting models from exactly this format.
And no, I’m calling you for changing your argument again. You were saying dumb shit about how they needed to open source all the data needed to train this thing from scratch, which is another wildly different argument.
I could kind of see things either way. Is this like not providing the source code, or is it like not providing the IDE, debugger, compiler, and linter that was used to write the source code? (Also, it feels a bit "looking a gift horse in the mouth" to criticize people who are giving away a cutting-edge model that can be used freely.)
> I could kind of see things either way. Is this like not providing the source code, or is it like not providing the IDE, debugger, compiler, and linter that was used to write the source code?
Do the engineers that made this hand edit this file? Or did they have other source that they used and this is the build product?
> (Also, it feels a bit "looking a gift horse in the mouth" to criticize people who are giving away a cutting-edge model that can be used freely.)
Windows was free for a year. Did that make it open source?
> Do the engineers that made this hand edit this file? Or did they have other source that they used and this is the build product?
Do any open source product provide all the tools used to make software? I haven't seen the linux kernel included in any other open source product and that'd quite frankly be insane. As well as including vim/emacs, gcc, gdb, X11, etc.
But I do agree that training data is more important than those things. But you need to be clear about that because people aren't understanding what you're getting at. Don't get mad, refine your communication.
> Windows was free for a year. Did that make it open source?
Windows didn't attach an Apache-2.0 license to it. This license makes this version of the code perpetually open source. They can change the license later, but it will not back apply to previous versions. Sorry, but this is just a terrible comparison. Free isn't what makes a thing "open source." Which let's be clear, is a fuzzy definition too.
What I'm asking for is pretty clear. The snapshot of code and data the engineers have checked into their repos (including data repositories) that were processed into this binary release.
> This license makes this version of the code perpetually open source.
It doesn't because they didn't release the source.
There's nothing stopping me from attaching an Apache 2 license to a shared library I never give the source out to. That also would not be an open source release. There has to be actual source involved.
> You’re welcome to fuck around and find out. Go release llama2 under Apache 2. You’re saying that’s fine right?
You're missing my point. Obviously you can't release someone else's IP under whatever license you see fit.
You can release your own binary under Apache 2. Doing so without releasing the source doesn't make it open source despite being an open source license.
> The answer to your question is that code stored as a binary is not different from code stored as text. Pickled models are code.
I'm not saying it's not code; I'm saying it's not source.
The data used to derive this model is not different from the brain and worldly observations and learnings of the engineers, which are not part of any open source materials.
What are you talking about "the brain" of the engineers, this is bonkers. Monocasa is being excruciatingly patient with you all but the fact is this was generated with tools and is not a source release, it's a final product or compiled or generated release.
Code generated with tools is still code. This code is the source. The output of the code is the output. Monocasa is failing to understand or perhaps intentionally not understanding the difference. In some contexts a “compiled release” implies an output that is largely immutable for practical purposes. That is not what this is. It’s technically a binary object, but it’s a binary object you can easily unpack to get executable code that you can read and edit. It is a convenient format different from classical text code. The fact that it’s a binary is completely irrelevant. It’s akin to arguing that code that is provided in a zip file cannot be open source. Both because it’s a compressed file and because it doesn’t include the compression algorithm.
With that understood, demanding the “tools” that were used to create the code is like asking for the engineers’ notebooks of design thoughts along the way. It has no bearing on your ability to use or modify it. This is not an open source project to make neural nets. This an open source project of a neural net.
If someone releases math_funcs.py, you don’t need anything about the tools that were used to create math_funcs.py to consider it open source.
So if you use tools that generate some boilerplate code as part of your project you need to include the boilerplate generator otherwise it’s not open source?
Why are you being so obtuse? No, devs don't have to include the source to vim in their repos. They have to include the source files for their product in their repos. I'm confident this just isn't that hard to understand.
These are the source files! I’m going to stop responding to monocasa because I think he is being obtuse and leading me to say things that you are misinterpreting.
There is no expectation to include vim, or any tools required to create a codebase. We agree. And that’s why this repo is sufficient. Asking for the tooling that was used to make this project would be out of scope and unreasonable.
This is a repo that can be used to make predictions. It is not a repo that is used to make models.
The source code of the repo referred to by “open source” is the code of the repo.
You can ask all you want but that is irrelevant as to whether it is open source. If photoshop were open source, the c++ code would need to be available. Not the tooling used to make the c++ code. The c++ code is equivalent to the model. Not the separate pythoj codebase that was involved in making it.
Which is some BSD PyTorch + PyTorch calling code that anyone competent in the field can implement any number of ways and is not special to this output.
> You can ask all you want but that is irrelevant as to whether it is open source.
There's a pretty good definition of open source at OSI [0], point of 2 of which is (emphasis mine):
"The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."
You can't bump the window on what "program" means. "Program" here doesn't mean "predictions", that's the output of the program. If you had a program that generated images, you wouldn't say that that program was the source code of the images. You would say that that program generates images and has source code.
This just isn't an open source release. It's freely released to the public, but it doesn't contain the source used to create or modify it.
> Which is some BSD PyTorch + PyTorch calling code that anyone competent in the field can implement any number of ways and is not special to this output.
> You can't bump the window on what "program" means. "Program" here doesn't mean "predictions", that's the output of the program. If you had a program that generated images, you wouldn't say that that program was the source code of the images. You would say that that program generates images and has source code
My dude… you have no idea what you’re talking about.
The picked model is the preferred way to interact and modify it. It is the source code. It is not like a compiled program. It is literally code.
I am NOT claiming the predictions are the program. I am saying the pickled model is. You 100% don’t need anything else to do anything more with the model.
I don’t know or care if they released their model generating code but nobody competent who understands what they are talking about cares about this.
It’s pickled because it’s big. Just imagine this as a zip containing algo.py
> Do any open source product provide all the tools used to make software? I haven't seen the linux kernel included in any other open source product and that'd quite frankly be insane. As well as including vim/emacs, gcc, gdb, X11, etc.
BSD traditionally comes as a full set of source for the whole OS, it's hardly insane.
But the point is you don't need those things to work on Linux - you can use your own preferred editor, compiler, debugger, ... - and you can work on things that aren't Linux with those things. Calling something "open source" if you can only work on it with proprietary tools would be very dubious (admittedly some people do), and calling a project open source when the missing piece you need to work on it is not a general-purpose tool at all but a component that's only used for building this project is an outright falsehood.
But what's proprietary here? That's what I'm not getting from the other person. You have the algorithm. Hell, they even provided the model in pytorch/python. They just didn't provide training parameters and data. But that's not necessary to use or modify the software just like it isn't necessary for nearly any other open sourced project. I mean we aren't calling PyTorch "not open source" because they didn't provide source code for vim and VS code. That's what I'm saying. Because at that point I'm not sure what's the difference between saying "It's not open source unless you provide at least one node of H100 machines." That's what you kinda need to train this stuff.
> But what's proprietary here? That's what I'm not getting from the other person. You have the algorithm. Hell, they even provided the model in pytorch/python. They just didn't provide training parameters and data. But that's not necessary to use or modify the software just like it isn't necessary for nearly any other open sourced project.
It's necessary if you want to rebuild the weights/factors/whatever the current terminology is, which are a major part of what they're shipping. If they found a major bug in this release, the fix might involve re-running the training process, and currently that's something that they can do and we users can't.
> I mean we aren't calling PyTorch "not open source" because they didn't provide source code for vim and VS code.
You can build the exact same PyTorch by using emacs, or notepad, or what have you, and those are standard tools that you can find all over the place and use for all sorts of things. If you want to fix a bug in PyTorch, you can edit it with any editor you like, re-run the build process, and be confident that the only thing that changed is the thing you changed.
You can't rebuild this model without their training parameters and data. Like maybe you could run the same process with an off-the-shelf training dataset, but you'd get a very different result from the thing that they've released - the whole point of the thing they've released is that it has the weights that they've "compiled" through this training process. If you've built a system on top of this model, and you want to fix a bug in it, that's not going to be good enough - without having access to the same training dataset, there's no way for you to produce "this model, but with this particular problem fixed".
(And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)
Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did. This honestly is quite common in ML and is quite frustrating as a researcher, especially when you get a round of reviewers who think benchmarks are the only thing that matters (I literally got a work rejected twice with a major complaint being about how my implementation didn't beat [0] despite it beating [1]. My paper wasn't even on architecture... so we weren't even trying to improve the SOTA...).
As a researcher I want to know the HPs and datasets used, but they honestly aren't that important for usage. You're right that to "debug" them one method would be to retrain from scratch. But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch. Most of the NLP research world is using pretrained models these days (I don't like this tbh, but that's a different discussion all together). Only a handful of companies are actually training models. And I mean companies, I don't mean academics. Academics don't have the resources (unless partnering), and without digressing too much, the benchmarkism is severely limiting the ability for academics to be academics. Models are insanely hard to evaluate, especially after RLHF'd to all hell.
> (And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)
The truth is that this is how most ML refinement is happening these days. If you want better refinement we have to have that other discussion.
> Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did.
I don't know about ML specifically, but I've seen a number of projects where people publish supposedly "the source" for something and it doesn't actually build. IMO if they're doing it wilfully that makes not open source, whereas if it's just good-faith legitimate incompetence then it can be.
(My litmus test would be: are they giving you all the stuff they'd give to a new hire/assistant working with them? If they've got a "how to build" on their internal wiki with a bunch of steps they're keeping secret, then it's not open-source. But if the process for a new hire is to hand over a code dump and say "huh, it works on my machine, I don't remember what I did to set it up", then at that point I'd consider it open source. I think this aligns with the "preferred form for making modifications" idea in the licenses).
> But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch.
Sure. But they'd have that capability in their back pocket for if they needed it. It's a similar story for e.g. parts of the Linux kernel code that are generated via a perl script based on the architecture documentation - you only actually re-run that perl script once in a blue moon, but it's important that they publish the perl script and not just the C that was generated by it.
Build scripts are not required for open source. Usually they are provided, because nobody actually wants to maintain them separately, but they are not actually part of the project itself. Often times it's just a few gnu scripts, sometimes there's parts of it missing (because they're reused from another project, or they have secrets in them that the maintainer can't be bothered to remove, or other reasons), and rarely the build script is an entire project itself, and even more rarely there's nothing there at all except a single file of source code that can't be built alone (I've seen this in particular on several old golang projects, and it's incredibly annoying).
I'm not asking for the engineers brains, I'm asking for more or less what's sitting in the IDE as they work on the project.
Robert has provided that there. Mistral has not.
As an aside, I'm more than capable of editing that code; I've professionally worked on FPGA code and have written a PS1 emulator. Taking that (wonderful looking code) and say, fixing a bug, adding a different interface for the cdrom, porting it to a new FPGA are all things I'm more than capable of.
No, but if the Windows binary code was made available with no restrictive licensing, I'd be quite happy, and the WINE devs would be ecstatic. Sure, the source code and build infrastructure would be nicer, but we could still work with that.
'gary_0' being happy with the license terms isn't what defines 'open source'.
I'm fairly happy with the license terms too. They're just not open source. We dilute the term open source for the worst if we allow it to apply to build artifacts for some reason.
We were talking about "looking a gift horse in the mouth", as in it's still a positive thing regardless of the semantic quibbles about open source. Nobody would argue that a hypothetical openly licensed Windows binary-only release is "open source" and I'd appreciate it if you read my comments more charitably in future.
Source code licenses are naturally quite clear about what constitutes "source code", but things are murkier when it comes to ML models, training data, and associated software infrastructure, which brings up some interesting questions.
> We were talking about "looking a gift horse in the mouth", as in it's still a positive thing regardless of the semantic quibbles about open source
Your gift horse in the mouth comment was visibly an aside in the greater discussion being enclosed in parenthesis.
> Nobody would argue that a hypothetical openly licensed Windows binary-only release is "open source" and I'd appreciate it if you read my comments more charitably in future.
That's why I'm using it as an example metaphor in my favor. It's clearly not open source even if they released it under Apache 2. It's not what their engineers edit before building it.
> Source code licenses are naturally quite clear about what constitutes "source code", but things are murkier when it comes to ML models, training data, and associated software infrastructure, which brings up some interesting questions.
I don't think they're all that murky here. The generally accepted definition being
> The “source code” for a work means the preferred form of the work for making modifications to it. “Object code” means any non-source form of a work.
Is this the form of the work that Mistral's engineers work in? Or is there another form of the work that they do their job in and used to build these set of files that they're releasing?
I'd actually say that including the training data would be like providing the IDE/debugger/compiler rather than the model/checkpoint being analogous. If I hand you Signal's source code you can run it, use it, modify it, etc. All similar characteristics to what is provided here. What they didn't provide to us is how they created that code. You couldn't create that software from scratch by just having these and that's true for any open source project. But I wouldn't say training data is as good as an analogy to peering in the minds of engineers, because it is an important part to getting the final product and analyzing it.
If that's what's needed to work at the level their engineers work on the model.
Which is true of traditional software as well. You don't get to call your binary open source just because you have licensed materials in there you can't release.
Not to be the devil's advocate here, but almost certainly it can be the case that data was used to define heuristics (potentially using automated statistical methods) that a engineer then formalized as code. Without that data that specific heuristic wouldn't exist, at least very likely not in that form. Yet that data does not have to be included in any open source release. And obviously you as a recipient of the release can modify the heuristic (or at least, you can modify the version that was codified), but you can not reconstruct it from the original data.
I know my example is not exactly what is happening here, but the two sound pretty affine to me and there seem to be a fairly blurry line dividing the two... so I would argue that where "this must be included in a open source release" ends and "this does not need to be included in a open source release" starts is not always so cut and dry.
(A variant of this, that happens fairly frequently, is when you find a commit that says something along the lines of "this change was made because it made an internal, non-public workload X% faster"; if the data that measurement is based upon did not exist, or if the workload itself didn't exist, that change wouldn't have been made, or maybe it would have been made differently... so again you end up with logic due to data that is not in the open source release)
If we want to go one step further, we could even ask: what about static assets (e.g. images, photographs, other datasets, etc.) included in a open-source release... maybe I'm dead wrong here, but I have never heard that such assets must themselves be "reproducible from source" (what even is, in this context, the "source" of a photograph?).
That being said, I sure wish the training data used for all of these models was available to everyone...
We just also shouldn't call releases with no source "open source".
I wouldn't really have a complaint with their source being released as Apache 2. I just don't want the term "open source" diluted to including just a release of build artifacts.
The question is am I provided the build source that constructed these files. Mistral did not hand edit these files to construct them, there's source out there that built them.
Like, come on, a 14GB of a dump of mainly numbers that were constructed algorithmically are not "source".