To expand on this, what we're seeing with LLaMa is that you can fine-tune your model using other models.
It's not clear that the quality will be exactly the same (in fact it will very likely be worse), but working generators are essentially ways to quickly generate training data. And I can't think of a legal argument for why generated output from a model would be less legal to use as training data than an unlicensed photo off of DeviantArt.
Nobody has really called out OpenAI on this, but OpenAI has a clause in it's TOS that you won't use output to build a competing model. But that's... just in it's TOS. If you don't have an OpenAI account, it's not immediately clear to me (IANAL) why you can't use any of the leaked training sets that other people have generated with ChatGPT to help align a commercial model.
Certainly if someone makes the argument that generators like Copilot/Midjourney aren't violating copyright by learning from their sources, it's very hard to make the argument that Midjourney/Copilot output is somehow different than that and their output can't be used to help generate training datasets.
It's not clear that the quality will be exactly the same (in fact it will very likely be worse), but working generators are essentially ways to quickly generate training data. And I can't think of a legal argument for why generated output from a model would be less legal to use as training data than an unlicensed photo off of DeviantArt.
Nobody has really called out OpenAI on this, but OpenAI has a clause in it's TOS that you won't use output to build a competing model. But that's... just in it's TOS. If you don't have an OpenAI account, it's not immediately clear to me (IANAL) why you can't use any of the leaked training sets that other people have generated with ChatGPT to help align a commercial model.
Certainly if someone makes the argument that generators like Copilot/Midjourney aren't violating copyright by learning from their sources, it's very hard to make the argument that Midjourney/Copilot output is somehow different than that and their output can't be used to help generate training datasets.