I agree with your premise: I have used 65b variants and of course they’re not as good as OpenAI. GPT3 has 175b parameters, and OpenAI has done more RLHF than anyone else. Why would we expect to get comparable performance with models a fraction of the size and a pittance of the fine tuning?
That said, it’s clear that replicating GPT4+ performance is within the resources of a number of large tech orgs.
And the smaller models can definitely still be useful for tasks.
I'd agree the secret sauce for how great the newest services perform is probably in the fine-tuning. We're seeing almost daily releases of fine-tuning data sets, training methods and models (at lower and lower costs) so I'm personally pretty optimistic that we'll be seeing some big improvement in self-hosted LLM performance pretty quickly.
That said, it’s clear that replicating GPT4+ performance is within the resources of a number of large tech orgs.
And the smaller models can definitely still be useful for tasks.