I'm not saying that smaller model task specialization is a bad thing. If anythin...

I'm not saying that smaller model task specialization is a bad thing. If anything the research kicked off with Orca into using more complex models to jumpstart fine tuning of much smaller models is probably my pick for the most important ML research trend of 2023.

But even in the example you bring up and your comments on it, I'd strongly recommend considering Goodhart's Law - turning a handful of measurements into the target by which we are throwing other things away to improve model scores on those measurements doesn't necessarily represent a path to best in class production feasibility.

I can imagine a number of edge cases where a customer service bot not having basic math capabilities could lead to issues ("Did the package come with at least 4 screws?" "It only had three" "Ok, great - I don't see any issues with the shipment and am denying the return request").

Further, many qualities which probably do matter for applications like customer service, like patience, empathy, or de-escalation - don't happen to be parts of the measurements any LLMs are being optimized to hit (even though they are almost certainly represented at least in part in the pretrained models given the presence in the data).

We've become a bit too focused on optimizing LLMs around measurements reflecting our anchoring biases of what AI should look like as imagined decades ago rather than evaluating the starting point and use cases as they actually occur as we might for a tool by any other name.

Though this is all an entirely different issue from whether different model sizes require their own best practices.