All large language models (GPT-2/3, GPT-Neo, Turing, Gopher), all use essentially the same architecture with some light variations, and the same datasets, again with some light variations on how filtering is done etc.
As such there is no reason to expect them to be very different in term of efficiency, and it has been shown and well researched, that scaling the numbers of parameters directly correlates with improved model quality.
So as long as you are comparing GPT style models to other GPT style models then parameter count is definitly not a vanity metric.
This doesn't hold once you start comparing to e.g. mixture of experts model which were making the headline recently with trillion of parameters claims. In MoE models, parameter counts is pretty much a useless metric.
As such there is no reason to expect them to be very different in term of efficiency, and it has been shown and well researched, that scaling the numbers of parameters directly correlates with improved model quality.
So as long as you are comparing GPT style models to other GPT style models then parameter count is definitly not a vanity metric.
This doesn't hold once you start comparing to e.g. mixture of experts model which were making the headline recently with trillion of parameters claims. In MoE models, parameter counts is pretty much a useless metric.