More like – it won't be useful to small-time developers (since they won't have t...

sheepscreek · on May 31, 2023

This is what I understood as well. They want to either democratize adoption or not release it. The last thing they/anyone wants is for another BigCo or Govt to h take undue advantage of the model (through fine-tuning?) when others cannot.

That said, I can imagine a GPTQ/4-bit quantized model to be smaller and easier to run on somewhat commodity clusters?

Or it could run with GGML/llama.cpp on a cloud instance with a TB of RAM?

After seeing what people were able to do with LLaMA, I am positive that the community will find a way to run it - albeit with some loss in performance.

It would be truly amazing if they used their computing to develop quantized models as well.

renonce · on June 2, 2023

A big chunk of developments based on Facebook’s LLaMA model are by small-time developers and individuals, not large players. Facebook has already shown a viable way to release models in the way you described.

lostmsu · on June 1, 2023

If you really need, 170B parameter model can infer a few tokens per minute on commodity hardware.