This is not quite human-level question-answering in the everyday sense of those words. The ZDNet headline is too clickbaity for my taste.
The answer to every question in the test is a preexisting snippet of text, or "span," from a corresponding reading passage shown to the model. The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]
[a] If this explanation isn't entirely clear to you, it might help to think of the problem as a challenging classification task in which the number of possible classes for each question is equal to the number of possible spans in the corresponding reading passage.
Before you blink an eye there will be some MBA-types working on PowerPoint proposals with detailed cost-benefit analyses for using those new AI machines they heard about that can read better than human beings. Needless to say, the technology will fall far short of expectations.
This is why there have been two AI winters already.
I think it's incumbent on people like you to get the word out that ML isn't going to put everyone's jobs at risk. Between this and self-driving cars, local governments are beginning to weigh spending tax dollars on these boondoggles, for example, self-driving cars, instead of on proven modes of transit like public transport.
The futurist writers peddling this stuff need to take a moment to chill and learn about the actual state of the underlying technology.
I currently am working on improving a existing QA style dataset and am exploring the best way to have Machine Reading Comprehension stop being about span selection and move to some kind of reasoning. Turns out making questions in a repeatable way that have some kind of reasoning is quite hard.
> The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]
Sounds like they've reinvented Jeopardy Watson's ability to excel at Q&A, but 12 years later.
Great result. At my job I manage a machine learning team and so I am fairly much all-in for deep learning to solve practical problems.
That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet. BTW, a good paper just released on the limitations of DL by Judea Pearl https://arxiv.org/abs/1801.04016
> That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet.
He just means all the things that we already know don't work + something that we don't know about yet that will. (emphasis on the part we have no fucking clue about).
It would be interesting to know how well some of the entries on the Squad page do for the Winograd Schema challenge (https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS....). Does anyone know if any of the systems have been tested on that as well?
I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with. If you read the paper, it goes on to say, in regards to human evaluation:
> Mismatch occurs mostly due to inclusion/exclusion
of non-essential phrases (e.g., monsoon trough versus
movement of the monsoon trough) rather than
fundamental disagreements about the answer.
I don't think I would call that "error," rather than ambiguity. In other words, there's more than one possible answer to the questions under these criteria -- English isn't a formal grammar where there's always one and only one answer. For instance, here's one of the questions from the ABC Wikipedia page:
> What kind of network was ABC when it first began?
> Ground Truth Answers: radio network radio radio network
> Prediction: October 12, 1943
Because the second human said "radio" instead of "radio network," I believe this would count as a human miss. But the answer is factually correct. Meanwhile, the prediction from the Stanford logistic regression (not the more sophisticated Alibaba model in the article, where I don't think results are published at this detail) is completely wrong. No human could make that mistake. And yet these are treated as equally flawed answers by the EM metric.
And yet this gets headlined as "defeats humans," not "learns to mimic human responses well."
> I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with
For some problems, sure. For prediction tasks on the other hand, you have an actual ground truth that can be compared to human a priori prediction.
Neural net NLP results are rarely about actual intelligence or clever use of latent variables that it figured out, and more "pattern matching" that explains why its errors are so different from human errors. It doesn't actually understand the problem, it's finding tricks to solve the questions that happen to be regularities in the dataset that us humans can't really see.
Thinking about this further. If the computer answers "radio," it scores a correct answer on the EM metric, even though that answer counts as a miss for the humans, assuming that I am reading the paper correctly. That seems like a bad way to evaluate this.
This is clickbait. Unless models are invariant to adversarial examples in SQuAD such as those described here: https://arxiv.org/abs/1707.07328, models doing really well on SQuAD doesn't mean a ton.
At NIPS 2017 there was a system which beat humans in a college QuizBowl competition. In many ways I think that was more impressive than excellent performance on SQuAD.
Yeah, I don't think it is. Year 2017 brought out some really bad humans, and it turns out, you don't really need better algorithms to beat them in linguistics. See for yourself in this video: https://www.youtube.com/watch?v=L0eY5TGEK2I
People who think AMP is a solution and not just another problem need to take this effect into account. Google isn’t just acting like a CDN here, they’re being intentionally misleading.
Despite all the problems with AMP that have been discussed elsewhere, the real zdnet page hijacked my browser, redirecting me to an ad page with no way to get back. In cases like this, I think AMP would be preferable.
I think the parent is concerned that it has the word 'sexy' in the title. In reality I suspect that this is just absentminded SEO tactics, or the result of an automated system. Even easier to do if english isn't your native language and it just becomes "put these magic words in the title to rank higher"
Huh? That's a very normal looking swimsuit. It just has the word "sexy" in the description. Sometimes children even swim naked! I think you're being too puritanical. It's not child abuse to sell children's clothes.
I should add that peadophilia is a sexuality, not a crime. Not all peadophiles abuse children and we should stop painting them all with the same broad brush as child abusers.
Yes, not all pedophiles act on their urges, yes, it may be stupid SEO tactics, yes, OP is a prude, but that whole page is just begging to be labeled as pedophilic in nature.
The answer to every question in the test is a preexisting snippet of text, or "span," from a corresponding reading passage shown to the model. The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]
Actual current results:
https://rajpurkar.github.io/SQuAD-explorer/
Paper describing the dataset and test:
https://arxiv.org/abs/1606.05250
[a] If this explanation isn't entirely clear to you, it might help to think of the problem as a challenging classification task in which the number of possible classes for each question is equal to the number of possible spans in the corresponding reading passage.