Training data quality and quantity is the bottleneck.
"Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we need to source, clean, and filter to around 33TB of text data for a 1T-parameter model." https://lifearchitect.ai/chinchilla/
GPT4 has been trained on images exactly for this reason (it might not have been worth it separately from multi-modality, but together these two advantages seem decisive).
"Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we need to source, clean, and filter to around 33TB of text data for a 1T-parameter model." https://lifearchitect.ai/chinchilla/
GPT4 has been trained on images exactly for this reason (it might not have been worth it separately from multi-modality, but together these two advantages seem decisive).