I work in AI and would roughly agree with it to first order.
For me the key breakthrough has been seeing how large transformers trained with big datasets have shown incredible performance in completely different data modalities (text, image, and probably soon others too).
This was absolutely not expected by most researchers 5 years ago.
The problem here is that transformers aren't always the best tool. They are powerful because they can capture large receptive fields but ConvNext even shows that a convolution can still out perform Transformers on some tasks. There's plenty of modern CNNs that have both small and large receptive fields that perform better than transformers. But of course there is no one size fits all architecture and it is fairly naive to think so. I think what was more surprising with transformers is that you could scale them well and perform complex tasks with them alone.
But I think a big part you're ignoring here is how training has changed in the last 5 years. You look back at something like VGG where we go depthwise with CNNs vs something like Swin and there's a ton more complexity in the latter (and difficulties in reproduction of results likely due to this). There's vastly more uses of data augmentation, different types of normalization (see ConvNext), better learning rate schedules, and massive compute allows for a much better hyper-parameter search (I do also want to note that there are problems in this area as many are HP searching over the test set and not a validation or training set).
Yeah, the field has come a long way and I fully expect it to continue to grow like crazy but AGI/HLI is a vastly more complex problem that we have a lot of evidence that the current methods won't get us there.
I wonder if combining different approaches works in current AI research. I suppose it should.
A suppose something approaching AGI, an intelligent agent which can act on a _wide_ variety of input channels and synthesize these inputs into a general model of the world, could use channel-specific approaches / stages before the huge "integrating" transformer step.
The problem is, the POI applies to literally everything, including events which seem to have well-understood and well-established causes. That makes it impossible to know when it's going to be applicable in the real world and when it's not.
For me the key breakthrough has been seeing how large transformers trained with big datasets have shown incredible performance in completely different data modalities (text, image, and probably soon others too).
This was absolutely not expected by most researchers 5 years ago.