The Illustrated Retrieval Transformer

m_ke · on Jan 3, 2022

Using external memory instead of encoding all of the knowledge in the model will take over all branches of applied ML.

A recognition model should use a similar mechanism to store short term context in a memory buffer from previous frames and a large external database of long term key value pairs that retain relevant semantic information for given embeddings.

Doing so will make it possible to update and expand the models without having to retrain and enable much better zero/few shot learning.

We already have a hacky version of this in our production app for food recognition. For new users we use a standard CNN to predict the items present in the image, once a user logs a few meals we use nearest neighbor search to match new images against previously submitted entries, which works extremely well.

nestorD · on Jan 3, 2022

Yes! I have long thought that GPT type of model are huge because they are forced to encode a lot of raw knowledge, giving them the ability to search for knowledge in a database solves that problem which should help making them smaller while scaling to larger datasets.

The cherry on top is that you could get not only information from the model but also the sources it used to make up its mind!

savant_penguin · on Jan 3, 2022

Do you apply the NN search on the raw images themselves or on the latent vector from the CNN?

m_ke · on Jan 3, 2022

Image embeddings, NN on raw images would scale horribly and not return anything relevant.

jayalammar · on Jan 3, 2022

Hi HN,

Summary: The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance.

Hope you find it useful. All feedback is welcome!

changoplatanero · on Jan 3, 2022

How would you respond to the argument that the size of the database should be accounted for when computing the size of the total model? How does the latency of the database lookup compare to the extra latency from running the full size gpt3?

jayalammar · on Jan 3, 2022

The size of the database, the training set, the details of the architecture, as well as results on benchmark tasks should all be considered in the comparison. I'm also a fan of Behavioral Testing [1].

Parameter count is not very accurate measure of model performance. Mixture of Expert models like the Switch Transformer [2] can be 1 trillion parameters in size, but are not 5X the performance, for example.

They clock the retrieval at 10 ms, unclear if that includes the BERT inference, however. My assumption is that it does not.

[1] https://arxiv.org/abs/2005.04118

[2] https://arxiv.org/abs/2101.03961

mountainriver · on Jan 3, 2022

I’ve been thinking on this too, the real driver is just that it’s hard to scale models to that size, I think with the MoE work happening like GLAM it will hopefully be easier to scale models in a distributed fashion.

axpy906 · on Jan 3, 2022

Thanks for you and all that you do Jay.

Any plans to do one on MOE for LM?

jayalammar · on Jan 3, 2022

I'm intrigued by them and wanted to dig deeper into Switch Transformer. So hopefully yeah if they continue to show promise.

Thank you!

fabbari · on Jan 3, 2022

/s AI finally reached the pinnacle of human intelligence: when asked a question it will now roll eyes and loudly declare: "Let me Google that for you".

savant_penguin · on Jan 3, 2022

Nice

What is the size of the database and how does it compare to the size of the gpt3 model? "( 2 trillion multi-lingual tokens )" But how much memory is that

How much compute does each method need?

Can you run it on a laptop?

jayalammar · on Jan 3, 2022

Back of the envelop: assuming 3 characters per token, 3 bytes per character (unicode is 1-4), that's 18 trillion bytes. So about 18 TB? Reasonable for disk size, unreasonable to be loaded in GPU memory.

Compute: Building the database requires lots of BERT pre-computation. But at inference time, RETRO is at least one BERT forward pass (a batch of all the chunks that it broke the input prompt into). Then the neighbors are computer via the Retro encoder (which seems small at 2 Transformer layers). Then the input prompt is processed similar to a GPT of 32 layers (with attention to the neighbors).

Run on a laptop? Perhaps on CPU. Try running T0 [1] which is larger at 11B parameters.

[1] https://huggingface.co/bigscience/T0pp