The way to think about it is that training a neural network is a form of compression that is very, very lossy. You can retrieve information from it but it will be very inaccurate compared to the original source.
For example, I assume LLaMa was trained on English Wikipedia data (it just makes sense). So let me try to prompt the 13B parameter model (which is 25 GiB to download, and 3.8 GiB after quantization) with "According to Wikipedia, the main causes of the French revolution are". It will give me the following continuations:
Attempt 1:
1) Social injustice and inequality;2)...
i need to do a report about the french revolution for my history class. so far i have these three questions:... [end of text]
Attempt 2:
1. The Enlightenment and its new ideas in philosophy had a great impact on France especially with regards their attitudes towards religion. There was also some political unrest due mainly from disagreements over how taxes were spent by different governments at that time....
What Were Three Causes of the French Revolution? The three causes for revolution are economic, social and governmental in nature... [end of text]
Attempt 3:
1) The Estates General was called by Louis XVI in order solve financial problems. However when they met there were three groups: Nobility who wanted a lot more power and money for themselves; Clergy (Bishops etc.) who also wanted more wealth and power but on top they didn’t want to pay taxes or be governed any other way then the old fashion “divine right” 3) The third group was called the Third Estate. They were made up of middle class citizens, peasants, artisans etc… These people had no political power and wanted more equality in society
While the Wikipedia article has a section on causes that starts with:
The underlying causes of the French Revolution are usually attributed to the Ancien Régime's failure to manage social and economic inequality. Rapid population growth and the inability to adequately finance government debt resulted in economic depression, unemployment and high food prices. Combined with a regressive tax system and resistance to reform by the ruling elite, it resulted in a crisis Louis XVI proved unable to manage.
So the model is completely unable to reconstruct the data on which it was trained. It does have some vague association between the words of "French revolution", "causes", "inequality", "Louis XVI", "religion", "wealth", "power", and so on, so it can provide a vaguely-plausible continuation at least some of the time. But it's clear that a lot of information has been erased.
The training sources and weights are public info. Less than 5% of the training was from Wikipedia and of that it covers many languages. English Wikipedia article text alone is ~22 GB when losslessly compressed so it's no surprised it's not giving original articles back.
CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk
Interesting that so many people seem to want the bugs in these LLMs to be rebranded as features.
Memorization and plagiarism used to be undesired problems - to be worked on to get rid of them. Amazing job of PR here to try to reframe it as a benefit.
For example, I assume LLaMa was trained on English Wikipedia data (it just makes sense). So let me try to prompt the 13B parameter model (which is 25 GiB to download, and 3.8 GiB after quantization) with "According to Wikipedia, the main causes of the French revolution are". It will give me the following continuations:
Attempt 1:
Attempt 2: Attempt 3: While the Wikipedia article has a section on causes that starts with: So the model is completely unable to reconstruct the data on which it was trained. It does have some vague association between the words of "French revolution", "causes", "inequality", "Louis XVI", "religion", "wealth", "power", and so on, so it can provide a vaguely-plausible continuation at least some of the time. But it's clear that a lot of information has been erased.