It's pretty much looking like anything can be extracted from language. Some harder than others for sure but with enough scale it does look like eventually everything falls. Text only GPT-4 has a pretty solid understanding of space that 3.5 definitely lacks. You can see more thorough experiments in the microsoft agi paper where they test it's ability to track the visual space of a maze.
There is such a thing as a text only GPT-4 lol. It wasn't trained to be multimodal from scratch. First a text only version was trained and then it was made multimodal somehow ( the details are unknown but making a text only LLM multimodal isn't new e.g Palm, Flamingo, Blip-2, Fromage). The text only version exists and is what the microsoft researchers had access to.
It has been, it was in the Microsoft research paper "Sparks of AGI". You can watch the lead author of the paper, Sebastien Bubeck, present it here: https://youtu.be/qbIk7-JPB2c
It's a good video for understanding GPT-4 as a "What are we sure that LLMs are technically capable of?" exercise. As he notes in the video right at the start, the model was made safe and thus has significantly lower performance in the public release, so the examples he shows aren't replicable in the different model the public has access to.
I see that you are probably referring to the claim at 4:30... but I'm not sure he is actually saying that the early model had no text capability or if it merely was not something they were given access to.