Thank you! Founder here :)
You are right, those are the base papers, but we have extended the set of objectives quite significantly, tapping into modalities that haven’t been publicly CLIP-ed :)
It is probably worth writing a paper about, but we are just too busy building tons of open-source stuff. Check out the GitHub org here: https://github.com/unum-cloud
It is not just about the tranformers, but also about databases, networking, and improving the modern data stack for very large scale retrieval-based AI. A lot of the pieces may be pre-production, but I believe the amazing HN community may still enjoy the ways we use io_uring, SIMD, and a few other less then popular technologies.
Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)
Not yet, but you can ping our team on Discord or Twitter. They are soft like marshmallows, a couple of compliments and they will be leaking scripts left and right :)
He-hey! Yes we are fast, but I don’t think we ever claimed 30x. We are faster in almost every workload (loose range scans for some reason), but at best by 7x (batch reads) and 5x (batch writes). Still, this should be plenty for all intents and purposes! I can post some updates on that tomorrow :)
Yes, we decided to keep UDisk closed source for now. That repo is just a tiny description for the expected configuration files. At this point UDisk powers our soon-to-be-public cloud offering and is piloting in a few FAANG scale companies. Our human resources are very limited for now, but we can probably run a couple more such pilots concurrently. Reach out to info [at] unum.cloud or join our Discord if you are from one of those large companies and want to battle-test our secret sauce on a few Petabytes of your data :)
I could now find license in the huggingface repo, but it seems like the codebase is Apache 2.0. Are the pretrained weights / checkpoints also covered under this (or other permissive) license?
In other words, can we use it for commercial purposes for free?
Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)
Are weights even copyrightable under US law? It seems like they'd be the output of an automatic process (the training program) the same way the art/text produced by AI models is, which to my understanding makes them not copyrightable material.
Compression, even lossy compression, doesn't remove copyright. Whether this is more like compression or a more "transformative use" is something the courts will have to decide someday.
It might be a good time to reread What Color Are Your Bits:
There is a lot of manual process involved such as writing training scripts, scarping and processing training data, choosing the best weights among several runs, and spending lots of costly computation. Maybe these should make it copyrightable.
They seem to be only testing for the image retrieval task, but I don’t think CLIP is actually used for image retrieval. Most cases, I see CLIP being used for semantic segmentation, detection etc. Do these guys have similar results on these tasks?
Hi!
I am one of the contributors!
We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf.
Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.
I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval.
As an ML researcher in grad school here’s what >80% use case of clip I’ve seen:
1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way
Hi!
You are right that we had to clarify that "100 times better at retrieval".
Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)
In practice CLIP can be used for many things. Originally however, the primary focus was/is indeed retrieval. This is obvious from the contrastive loss used where they minimize errors with regard to a single hard positive from a batch of thousands of known hard negatives.
This is also informed by existing computer science objectives surrounding indexing, clustering of data and efficient search over data features.
You seem to be grossly mistaken: https://openai.com/research/clip
Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation .
Apologies, I appreciate the correction and indeed I was mistaken. There is mention of retrieval in the end of the paper, but indeed the focus is on classification tasks. Here’s the relevant portion in any case.
> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.
> The original CLIP was trained on 500x A100 Nvidia GPUs. The latest Open_CLIP trained on 1024x GPUs.
> We trained on the setup of 3x workstations, with 4x RTX 3090 consumer-grade GPUs in each, connected over 200 GBit InfiniBand HDR.
ok so 85x improvement on the GPU count (i suspect even better once you take into account the differences in consumer grade GPU) but i must still be missing something - where does it say it uses 100x less data?
This may be a dumb question, but would it be possible to apply these techniques to something like text completion and/or visual question answering? If you went ahead and used the optimizations but still scaled the model up?
Yes, we are training text embedding models right now. And also have plans to open-source some of them!
In addition, we train encoders for different modalities with retrieval purposes. For example, video data.
It is exciting that you could train a CLIP-style model from scratch with only 4M datapoints. But if you’ve got that data, why not fine tune a pretrained model with your 4M points? It seems likely to outperform the from-scratch method.
There is not only a difference in the data source but pre-trained tasks as well.
But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval.
And it is correct for CLIP, ALBEF, VICHA, and UFORM.
There's no one answer to that since different models are.. different. Beyond just modalities (text input and image output? image input and video output?), there are different common underlying tools used to build them. And then, of course, what do you mean by API? How do you want to interact with it?
As a general thing, you'd take a request that would require an inference step, which would then invoke the model with some parameters and input, and return the output. Beyond that, you'd need more detail.
I specialize in this area and build a product for self hosted inference.
The challenge to support a new model architecture is about coding the preprocessing for inputs (like tokenization or image resizing and color feature extraction) and post processing the outputs (for example entity recognition needs to lookup the entities and align the text).
Once an architecture is coded for the pre/post processing, then serving a new model for inference with that architecture is easy!
For me - the biggest thing I am looking for is a serverless vector data store. Competitors like Pinecone work just fine but they go from 0-70 as soon as you upgrade to a pod.
If you can figure out pricing primarily based on usage you can capture a whole segment of this market.
Both will be soon available on cloud marketplaces, but server-less options are a bit harder to cook. Our Discord is the best place to continue conversation: https://discord.gg/Bbh2bjNhvz
UKV looks really cool. At Marqo.ai we are achieving end-to-end semantic search building on top of existing embedding search functionality, keen to take a closer look at what you are doing. Seeing UKV come out on the cloud serverless will be super interesting - it's something thats taken us a while for us to work out.
I don't love the way their tables[1] report performance though. My understanding is that the "Dataset" column in the table represents the size of the training dataset, not the size of the dataset they are evaluating on. Note that this undersells their performance though, so it isn't like they are trying to hide something here!
Also I'd love to see someone do a similar benchmark for the OpenAI CPT-3 embeddings. I'm pretty unclear how well they compare to something like FLAN-T5, because they don't seem to be evaluated anywhere in the retrieval setting (unless I've missed it?)
Hi!
MSCOCO and Flickr datasets are the main datasets for Image retrieval. The results published in most papers (including CLIP) are based on them. So we used exactly these datasets for evaluation.
The datasets we used are pretty clean themselves if we compare them with LAION.
But we also filtered out images with captions on them and by CLIP's scores.
Btw, huge thanks for Laion and Open_clip projects! It inspires us a lot.
Lot of tricks put together for a great final result it seems