... and perhaps a LLM that shares the latent space with the visual model, because those are apparently (and surprisingly) easily mapped to each other (at least per that one paper that popped up the other day).
The bit that's missing is on-line learning. There's only so much you can keep bouncing around in working memory (context window of all the component models) - eventually you want to "fix" some of the context by altering the weights of the models (a kind of gradual fine-tuning?).
Our brains have different areas with different functions… so like, why wouldn’t a good AI too?
Maybe an LLM for an internal monologue, maybe two or three to debate each other realistically, then a computer vision model to process visual input…