Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.