Unfortunately, around half of his arguments (the "LLMs can't do that" and "LLMs only output text" half) is already obsoleted by PaLM-E (which is an LLM, augmented to accept vision input, and given direct control of a robot):
PaLM-E doesn't have much in the way of agency, but it is certainly an agent with a world model and the ability to act within it.
LLMs are proving broadly capable and flexible, and we're only scratching the surface so far. Something else (or several) is likely to supersede them in the way to AGI, but it won't be because LLMs aren't actually getting us closer (even though we can't tell how much closer). AlexNET got us closer, GANs got us closer, etc. Now transformer-based LLMs are too.
https://arstechnica.com/information-technology/2023/03/embod...
PaLM-E doesn't have much in the way of agency, but it is certainly an agent with a world model and the ability to act within it.
LLMs are proving broadly capable and flexible, and we're only scratching the surface so far. Something else (or several) is likely to supersede them in the way to AGI, but it won't be because LLMs aren't actually getting us closer (even though we can't tell how much closer). AlexNET got us closer, GANs got us closer, etc. Now transformer-based LLMs are too.