Are you suggesting use the clip embedding for the text as a feature to train a s...

daemonologist · 2025-11-14T14:20:09 1763130009

I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).

There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.

PaulHoule · 2025-11-14T14:08:33 1763129313

I think he is. I do things like that plenty.