Then would know that visual question answering performs quite poorly. It makes for a nice demo, but it's trivial/quick/easy to find entire areas of questions where the models completely. The questions have to be worded very carefully. The models are quite finnicky and dumb. For example, dynamic memory nets aren't the state of the art - the state of the art for these datasets doesn't use anything nearly as complicated as memory, etc.