That’s impressive work. Still don’t think we have reached human level for all the categories of things we see in images. But you are correct that my comment about 1k categories is not true for many production systems.
Whilst there are plenty of things in CV that computers aren't super-human at yet, object classification (given 100+ examples) is not one of them. In datasets with tens of thousands of categories, humans are much worse than computers - e.g. humans are really not good at knowing the difference between every type of mushroom, algae, and model of airplane.
Further, nearly every time a computer recently has been trained to do some very nuanced classification, such as in radiology, they exceed human expert performance.
(Outside of classification, computers are rapidly making progress - for instance they are getting surprisingly good at predicting the next few frames of a video, which requires a lot of "world knowledge" to do correctly.)
Definitely not close to having things work for all categories. As you scale up to more categories ambiguity and specificity becomes an issue. Clarifai has a nice demo of their model which has >10K classes, https://clarifai.com/demo , the top predictions are usually correct but not always the most relevant.
I only linked to the xception paper because it mentions JFT. It's not state of the art for large scale recognition.
It's not just a matter of detecting objects and locating them. The deeper computer vision problem is to identify object attributes, relations between objects and actions in video. It's much harder to do that because many relations appear in very diverse situations, with objects of different categories, so it's hard to have 1000's of examples for each class of relation.
For example, humans can identify a monkey riding a Segway on the airport runway, but there probably is no such thing in the training set, even if it is quite large. The neural net might not know if that constitutes a "riding" action because it has never seen such a combination. Maybe the monkey is jumping over the thing and the picture shows it in proximity to it, not riding it - a human would know that a slight gap means there is no riding taking place.
Then, the even harder problem is to predict the consequences of actions on objects and just to physically simulate the scene. Such knowledge is useful in robot action planning. Beyond computer vision, there is also a need to create a "mental simulator" that has theory of mind and can simulate other agents (what humans intend), and we need simulators, both physical and mental to create the next level of AI.
The link only works if you are logged in, otherwise it says the page not found, which is wrong message because it makes you think it doesn't exists even if you login.
https://arxiv.org/abs/1610.02357