That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.
> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...
Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.
ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.