> Reasonable to assume that in 1-2 years it will also come down in cost.
Definitely. I'm guessing they used something like quantization to optimize the vram usage to 4bit. The thing is that if you can't fit the weights in memory then you have to chunk it and that's slow = more gpu time = more cost. And even if you can fit it in GPU memory, less memory = less gpus needed.
But we know you _can_ use less parameters, and that the training data + RLHF makes a massive difference in quality. And the model size linearly relates to the VRAM requirements/cost.
So if you can get a 60B model to run at 175B's quality, then you've almost 1/3rd your memory requirements, and can now run (with 4bit quantization) on a single A100 80GB which is 1/8th the previously known 8x A100's that GPT-3.5 ran on (and still half GPT-3.5+4bit).
Also while openai likely doesn't want this - we really want these models to run on our devices, and LLaMa+finetuning has shown promising improvements (not their just yet) at 7B size which can run on consumer devices.
GPT3 on release was more expensive ($0.06/1000 tokens vs $0.03 input and $0.06 output for GPT4).
Reasonable to assume that in 1-2 years it will also come down in cost.