What data shortage problem? I'm not convinced that a shortage of data is the pro...

What data shortage problem? I'm not convinced that a shortage of data is the problem with current generation LLMs. This isn't like robotics where every robot is unique and you had to historically start from scratch every time you changed to a different robot. It's more likely that we are running into some sort of generalization bottleneck, because the training process is operating without feedback on the information/semantic level. There is no loss function for "does the code compile?". Instead, the loss function checks "does the output conform to the dataset?".