This is where the terminology being used to discuss LLMs today is a touch awkward and imprecise.
There's a key distinction between smaller models trained with transfer-learning, and just fine-tuning a smaller LLM and still using in-context learning.
Transfer learning means you're training an output network specifically for the task you're doing. So like, if you're doing classification, you output a vector with one element per class, apply a softmax transformation, and train on a negative log likelihood objective. This is direct and effective.
Fine-tuning a smaller LLM so that it's still learning to do text generation, but it's better at the kinds of tasks you want to do, is a much more mixed experience. The text generation is still really difficult, and it's really difficult to learn to follow instructions. So all of this still really favours size.
Right that is a good distinction. Fair enough. Still stand that you could train a worse model depending on the task. Translation, Nuanced Classification are all instances where i've not seen bespoke models outright better than GPT-4. although, like i said it could still be good enough for speed, compute requirements.
There's a key distinction between smaller models trained with transfer-learning, and just fine-tuning a smaller LLM and still using in-context learning.
Transfer learning means you're training an output network specifically for the task you're doing. So like, if you're doing classification, you output a vector with one element per class, apply a softmax transformation, and train on a negative log likelihood objective. This is direct and effective.
Fine-tuning a smaller LLM so that it's still learning to do text generation, but it's better at the kinds of tasks you want to do, is a much more mixed experience. The text generation is still really difficult, and it's really difficult to learn to follow instructions. So all of this still really favours size.