What's interesting about the minimization of combined training + (model lifetime...

What's interesting about the minimization of combined training + (model lifetime) inference cost is that that is going to look different for different companies, depending on what their inference volume is...

Meta have a massive user base, and if they are using these models to run their own business, then that implies massive inference volume, and that it might make economic sense for them to put more money into training (to make smaller/cheaper models more powerful) than for other companies with lower inference volume.

To put it another way, it'd not be surprising - if their internal use of these models is very high - to see Meta continuing to release models that size for size beat the competition since they were incentivized to pump more tokens through them during training.