A computer vision system can have multiple ways of processing an image. So at the limit, it could interpret a scene in terms of what a human sees and also have a separate, better understanding of the scene.
The OP shows that computers DO NOT have a better understanding. Its evident they have no understanding at all; they are simply doing math on pixels and latching on to coincidental patters of color or shading.
People recognize things by building a 3D model in their head, then comparing that to billions of experiential models, finding a match and then using cognition to test that match. "Is that a bird? No, its just a pattern of dog dropping smeared on a bench. Ha ha!"
How could I have better put So at the limit, it could?
I meant to talk about what some hypothetical future system could do (which I think was a reasonable context given the comment I replied to), not to characterize current systems.
To get there, computers will clearly have to change utterly their approach. A cascaded approach of quick-math followed by a more 'cognitive' approach on possible matches, could definitely improve on the current state of affairs.
>People recognize things by building a 3D model in their head, then comparing that to billions of experiential models, finding a match and then using cognition to test that match. "Is that a bird? No, its just a pattern of dog dropping smeared on a bench. Ha ha!"
So you're saying people are generative reasoners with very general hypothesis classes rather than discriminative learners with tiny hypothesis classes.
To which the obvious response is, yes, we know that. The question is how to make some computerization of general, generative learning work fast and well.
People are far more than that. Lots of our brain is dedicated to visual modeling. Those 'hypothesis classes' are just the tip of the iceberg. For computers, they're the whole enchilada. To mix metaphors.
We can't help but build real models of what we see - our retina/optic nerve are already doing this before our brain even receives the 'image'!
I can't help but believe some of the image recognition mentioned in your article, especially of icons, is built through previous experience with similar iconic images. Symbols for things become associated with the real things. Its a modern adaptation of a much older processing mechanism.
OK... but how is that pattern-matching different from what the computer is doing? Why is human pattern-matching "understanding" and computer patter-matching is not?
Its the 2nd state of cognitive engagement that makes humans different. Of course a field of static isn't a panda. The computer has no capacity to recognize the context.
I think I get your point now. It's OK if a human momentarily mistakes a random blob for a panda, but they should be able to figure out from other visual cues and context that it's not a panda. And it's that second part that's missing from the computer models?
That would be interesting; it could flag inputs that are ambiguous to humans but not machines (or vice versa, or when there's a discrepancy at all) since it could suggest that something shady is happening.