It's more like a 60/40 problem right now. Generating reasonable sounding text was a huge problem for decades. That seems to be a relatively minor part of understanding what those words mean in more context than "the likely sequence that follows a prompt".
Although I am impressed with how well the models perform when trained purely in a Chinese Room format. I think they have gleaned some understanding of some systems beyond just a super powered Markov chain.
I'll admit that's possible. But if it's coded in a set of weights that happen to generate a good answer, how can we tell? I honestly think explainability is the most important problem in AI right now. And not explainability in terms of the model generating a series of words, but introspective explainability. If the models get to superhuman levels, maybe we don't have to have that, but until one or the other I don't know how we can demonstrate anything beyond super Markov chain.