It is exciting that you could train a CLIP-style model from scratch with only 4M...

vov_or · on Feb 28, 2023

There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.

ttt3ts · on Feb 28, 2023

Any plans to document how to fine tune your models then?

vov_or · on Feb 28, 2023

It will take some time, but yes, we have this in our plans.

riku_iki · on Feb 28, 2023

perhaps this approach can lead to better training of foundational models?..

vov_or · on Feb 28, 2023

More efficient - for sure!