Hacker News new | past | comments | ask | show | jobs | submit login
Beating OpenAI CLIP with 100x less data and compute (unum.cloud)
342 points by vov_or on Feb 28, 2023 | hide | past | favorite | 57 comments



From what I understand the basis for their model are these two described in these papers: https://arxiv.org/abs/2107.07651 https://arxiv.org/abs/2208.13628

Lot of tricks put together for a great final result it seems


Thank you! Founder here :) You are right, those are the base papers, but we have extended the set of objectives quite significantly, tapping into modalities that haven’t been publicly CLIP-ed :)

It is probably worth writing a paper about, but we are just too busy building tons of open-source stuff. Check out the GitHub org here: https://github.com/unum-cloud

It is not just about the tranformers, but also about databases, networking, and improving the modern data stack for very large scale retrieval-based AI. A lot of the pieces may be pre-production, but I believe the amazing HN community may still enjoy the ways we use io_uring, SIMD, and a few other less then popular technologies.


Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)


Not yet, but you can ping our team on Discord or Twitter. They are soft like marshmallows, a couple of compliments and they will be leaking scripts left and right :)


> may still enjoy the ways we use io_uring, SIMD, and a few other less then popular technologies

Fairly standard for any perf-minded shop but good to see more people discovering them.


man I just looked at ukv, it looks to good to be true, 30x RocksDB, wtf! Hoping it's true


He-hey! Yes we are fast, but I don’t think we ever claimed 30x. We are faster in almost every workload (loose range scans for some reason), but at best by 7x (batch reads) and 5x (batch writes). Still, this should be plenty for all intents and purposes! I can post some updates on that tomorrow :)


If you are curious about how it works, here is a pretty good explanation: https://youtube.com/watch?v=ybWeUf_hC7o

For some reason the conference hasn’t made the last years talks public or searchable, but you should be able to access it with a link


where is the udisk? Repo is just a readme on configuration.


Yes, we decided to keep UDisk closed source for now. That repo is just a tiny description for the expected configuration files. At this point UDisk powers our soon-to-be-public cloud offering and is piloting in a few FAANG scale companies. Our human resources are very limited for now, but we can probably run a couple more such pilots concurrently. Reach out to info [at] unum.cloud or join our Discord if you are from one of those large companies and want to battle-test our secret sauce on a few Petabytes of your data :)


I could now find license in the huggingface repo, but it seems like the codebase is Apache 2.0. Are the pretrained weights / checkpoints also covered under this (or other permissive) license?

In other words, can we use it for commercial purposes for free?


Hi! Just added Apache2.0 to HF models card. Thanks!


Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)


Are weights even copyrightable under US law? It seems like they'd be the output of an automatic process (the training program) the same way the art/text produced by AI models is, which to my understanding makes them not copyrightable material.


Compression, even lossy compression, doesn't remove copyright. Whether this is more like compression or a more "transformative use" is something the courts will have to decide someday.

It might be a good time to reread What Color Are Your Bits:

https://ansuz.sooke.bc.ca/entry/23


There is a lot of manual process involved such as writing training scripts, scarping and processing training data, choosing the best weights among several runs, and spending lots of costly computation. Maybe these should make it copyrightable.


Good question, was about to ask the same!


They seem to be only testing for the image retrieval task, but I don’t think CLIP is actually used for image retrieval. Most cases, I see CLIP being used for semantic segmentation, detection etc. Do these guys have similar results on these tasks?


Hi! I am one of the contributors! We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf. Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.


I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval. As an ML researcher in grad school here’s what >80% use case of clip I’ve seen: 1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way


Hi! You are right that we had to clarify that "100 times better at retrieval". Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)


In practice CLIP can be used for many things. Originally however, the primary focus was/is indeed retrieval. This is obvious from the contrastive loss used where they minimize errors with regard to a single hard positive from a batch of thousands of known hard negatives.

This is also informed by existing computer science objectives surrounding indexing, clustering of data and efficient search over data features.


You seem to be grossly mistaken: https://openai.com/research/clip Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation .


Apologies, I appreciate the correction and indeed I was mistaken. There is mention of retrieval in the end of the paper, but indeed the focus is on classification tasks. Here’s the relevant portion in any case.

> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.


> The original CLIP was trained on 500x A100 Nvidia GPUs. The latest Open_CLIP trained on 1024x GPUs.

> We trained on the setup of 3x workstations, with 4x RTX 3090 consumer-grade GPUs in each, connected over 200 GBit InfiniBand HDR.

ok so 85x improvement on the GPU count (i suspect even better once you take into account the differences in consumer grade GPU) but i must still be missing something - where does it say it uses 100x less data?


Look at the “dataset” column: CLIP was trained on 400m images, UForm on 4m.


There are also dataset sizes for Albef and ViCHA.


This may be a dumb question, but would it be possible to apply these techniques to something like text completion and/or visual question answering? If you went ahead and used the optimizations but still scaled the model up?


Yes, it is possible. Approaches, on which our model is based, are capable to solve VQA and other similar tasks showing SOTA results.


Do you know anyone working on a large text completion model based on it?


Do you have/plan to have a text embeddings model?


Yes, we are training text embedding models right now. And also have plans to open-source some of them! In addition, we train encoders for different modalities with retrieval purposes. For example, video data.


It is exciting that you could train a CLIP-style model from scratch with only 4M datapoints. But if you’ve got that data, why not fine tune a pretrained model with your 4M points? It seems likely to outperform the from-scratch method.


There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.


Any plans to document how to fine tune your models then?


It will take some time, but yes, we have this in our plans.


perhaps this approach can lead to better training of foundational models?..


More efficient - for sure!


I read a lot about training models and so on, but very little about inference.

Let's say you came up with the custom model that gives good results, how do you transfer that model so it can be used in an API?


There's no one answer to that since different models are.. different. Beyond just modalities (text input and image output? image input and video output?), there are different common underlying tools used to build them. And then, of course, what do you mean by API? How do you want to interact with it?

As a general thing, you'd take a request that would require an inference step, which would then invoke the model with some parameters and input, and return the output. Beyond that, you'd need more detail.


I specialize in this area and build a product for self hosted inference.

The challenge to support a new model architecture is about coding the preprocessing for inputs (like tokenization or image resizing and color feature extraction) and post processing the outputs (for example entity recognition needs to lookup the entities and align the text).

Once an architecture is coded for the pre/post processing, then serving a new model for inference with that architecture is easy!


For me - the biggest thing I am looking for is a serverless vector data store. Competitors like Pinecone work just fine but they go from 0-70 as soon as you upgrade to a pod.

If you can figure out pricing primarily based on usage you can capture a whole segment of this market.


Great point! I would be happy to get more input and brain-storm a good pricing model together, one that is fair both for developers and for users.

We have an source project UKV, that partly overlaps with vector-search: https://github.com/unum-cloud/ukv

Another one - UNSW, is a placeholder for now: https://github.com/unum-cloud/unsw

Both will be soon available on cloud marketplaces, but server-less options are a bit harder to cook. Our Discord is the best place to continue conversation: https://discord.gg/Bbh2bjNhvz

Thank you for advice!


UKV looks really cool. At Marqo.ai we are achieving end-to-end semantic search building on top of existing embedding search functionality, keen to take a closer look at what you are doing. Seeing UKV come out on the cloud serverless will be super interesting - it's something thats taken us a while for us to work out.


This looks interesting for image retrieval.

I don't love the way their tables[1] report performance though. My understanding is that the "Dataset" column in the table represents the size of the training dataset, not the size of the dataset they are evaluating on. Note that this undersells their performance though, so it isn't like they are trying to hide something here!

Also I'd love to see someone do a similar benchmark for the OpenAI CPT-3 embeddings. I'm pretty unclear how well they compare to something like FLAN-T5, because they don't seem to be evaluated anywhere in the retrieval setting (unless I've missed it?)

[1] See "Zero-Shot Image Retrieval, English-only" in https://www.unum.cloud/blog/2023-02-20-efficient-multimodali...


Hi! MSCOCO and Flickr datasets are the main datasets for Image retrieval. The results published in most papers (including CLIP) are based on them. So we used exactly these datasets for evaluation.


Not sure if I'm blind, but what is the number of parameters?


143M - English 206M - Multilingual


How did you deal with data contamination?


The datasets we used are pretty clean themselves if we compare them with LAION. But we also filtered out images with captions on them and by CLIP's scores. Btw, huge thanks for Laion and Open_clip projects! It inspires us a lot.


Did the author report metrics of the unimodal model or of the multimodal model with re-ranking?


The results are reported with the multimodal model.


The sample code has an error in it, it uses `model` before initializing it.


Thanks Seems like a typo. It will be fixed soon


Am I the only one who is very confused what this is?


This is a good introduction to OpenAI CLIP, which should help provide context. https://openai.com/research/clip


thank you for this primer!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: