Open source becomes really complicated once assets with unclear license are invo...

GeneralMayhem · on Feb 10, 2025

Games like that, or the open-source clones of commercial games that require original assets to play (e.g. OpenXCOM), actually give a very clear analogy here: open source does not mean open assets. The software code is under a separate license from the data it processes. Emulators like Dolphin are kind of in this situation too - the program is open, the data it processes is not.

And that's fine! It's still valuable to have access to the source code, even if the "batteries" aren't included. Of course, if you really want to call it an open source model you should include the source for the data scraping/cleaning stages too; then the only thing missing would be the compute time and risk of acquiring dubiously-legal inputs.

I personally prefer a taxonomy like:

* Open weights: you can download the artifact and run it locally, not just use it through an application like chatgpt or an API.

* Open source: the code that created the artifact is provided in the same format that the authors used to work on it.

* Open data: the dataset that the source code was used on is available for download.

All three of those could be individually licensed or released, for 8 possible combinations. In the analogy to games, they would correspond to the licenses on the retail binary, the source code of the game, and the original uncompressed art assets or Blender projects, respectively.

sigmoid10 · on Feb 13, 2025

If it has already been established that open source doesn't mean open assets, why would we change that now? After all, training data is literally nothing but assets - except that you don't need them to run the application. So in that sense open LLMs are more open than these games.

GeneralMayhem · on Feb 14, 2025

But the training data isn't open...

I agree that open source doesn't mean open assets, but neither does open assets mean open source. You could make a linguistic argument that the training data is part of the "source" of the model (as in, from whence it came), but in any case the point is moot because neither the training data nor the code is open.

amelius · on Feb 10, 2025

OK, but that leaves the tools used to train the model (aka the build scripts). These could be open sourced.