As LLMs are productionised/commodified they're incorporating changes which are enthusiast-unfriendly. Small dense models are great for enthusiasts running inference locally, but for parallel batched inference MoE models are much more efficient.
Of course people don't say it, but there are many cases where reported algorithmic improvements are attributable to poor baseline tuning or shoddy statistical treatment. Tao is exhibiting a lot more epistemic humility than most researchers who probably have stronger incentives to market their work and publish.
Inferior in what sense? Genie 3 is addressing a fundamentally different problem to a physics sim or procgen: building a good-enough (and broad-enough) model of the real world to train agents that act in the real world. Sims are insufficient for that purpose, hence the "sim2real" gap that has stymied robotics development for years.
Genie 3 is inferior in the sense you just described: the sim2real gap would be greater, because it's a less accurate model of the aspects of the world that are relevant to robotics.
The reports are definitely bland, but I find them very helpful for discovering sources. For example, if I'm trying to ask an academic question like "has X been done before," sending something to scour the internet and find me examples to dig into is really helpful - especially since LLMs have some base knowledge which can help with finding the right search terms. It's not doing all the thinking, but those kind of broad overviews are quite helpful, especially since they can just run in the background.
I do that too, I wonder how much of it is the LLM being helpful and how much of it is the RAG algorithm somehow providing better references to the LLM than a google search can?
Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.
That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).
That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.
Cutting people at FAIR is a real shame though - great models like DINO and SAM have had massive positive impact - hopefully that work doesn't slow in favour of LLM-only development at MSL.
reply