It is exciting that you could train a CLIP-style model from scratch with only 4M datapoints. But if you’ve got that data, why not fine tune a pretrained model with your 4M points? It seems likely to outperform the from-scratch method.
There is not only a difference in the data source but pre-trained tasks as well.
But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval.
And it is correct for CLIP, ALBEF, VICHA, and UFORM.