Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Open source becomes really complicated once assets with unclear license are involved in any way. Lots of people for example would say that Jedi Knight 2 is open source because Raven Software released the source code and tools needed to build the game. But that alone doesn't mean you can run it, because you still need to get a hold of all the assets (models, textures, sounds) which may or may not still be property of LucasArts or its successors. Even if you have them, it's actually unclear if it is legal to use them this way. So while there are tons of people working on mods and conversions, noone in their right mind would distribute all the source assets.

Much in the same way, no sane company will touch the legal nightmare of releasing LLM training data scraped from public websites. Even releasing the LLM alone might be infringement, there are literally court cases being fought over this right now.



Games like that, or the open-source clones of commercial games that require original assets to play (e.g. OpenXCOM), actually give a very clear analogy here: open source does not mean open assets. The software code is under a separate license from the data it processes. Emulators like Dolphin are kind of in this situation too - the program is open, the data it processes is not.

And that's fine! It's still valuable to have access to the source code, even if the "batteries" aren't included. Of course, if you really want to call it an open source model you should include the source for the data scraping/cleaning stages too; then the only thing missing would be the compute time and risk of acquiring dubiously-legal inputs.

I personally prefer a taxonomy like:

* Open weights: you can download the artifact and run it locally, not just use it through an application like chatgpt or an API.

* Open source: the code that created the artifact is provided in the same format that the authors used to work on it.

* Open data: the dataset that the source code was used on is available for download.

All three of those could be individually licensed or released, for 8 possible combinations. In the analogy to games, they would correspond to the licenses on the retail binary, the source code of the game, and the original uncompressed art assets or Blender projects, respectively.


If it has already been established that open source doesn't mean open assets, why would we change that now? After all, training data is literally nothing but assets - except that you don't need them to run the application. So in that sense open LLMs are more open than these games.


But the training data isn't open...

I agree that open source doesn't mean open assets, but neither does open assets mean open source. You could make a linguistic argument that the training data is part of the "source" of the model (as in, from whence it came), but in any case the point is moot because neither the training data nor the code is open.


OK, but that leaves the tools used to train the model (aka the build scripts). These could be open sourced.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: