Hacker News new | past | comments | ask | show | jobs | submit login
Alibaba neural network defeats human in global reading test (zdnet.com)
170 points by ClintEhrlich on Jan 15, 2018 | hide | past | favorite | 46 comments



This is not quite human-level question-answering in the everyday sense of those words. The ZDNet headline is too clickbaity for my taste.

The answer to every question in the test is a preexisting snippet of text, or "span," from a corresponding reading passage shown to the model. The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]

Actual current results:

https://rajpurkar.github.io/SQuAD-explorer/

Paper describing the dataset and test:

https://arxiv.org/abs/1606.05250

[a] If this explanation isn't entirely clear to you, it might help to think of the problem as a challenging classification task in which the number of possible classes for each question is equal to the number of possible spans in the corresponding reading passage.


Agreed, but it was the least clickbaity headline I saw about this result.

Compare: "ROBOTS CAN NOW READ BETTER THAN HUMANS, PUTTING MILLIONS OF JOBS AT RISK" http://www.newsweek.com/robots-can-now-read-better-humans-pu...


Jeez...

Before you blink an eye there will be some MBA-types working on PowerPoint proposals with detailed cost-benefit analyses for using those new AI machines they heard about that can read better than human beings. Needless to say, the technology will fall far short of expectations.

This is why there have been two AI winters already.


I think it's incumbent on people like you to get the word out that ML isn't going to put everyone's jobs at risk. Between this and self-driving cars, local governments are beginning to weigh spending tax dollars on these boondoggles, for example, self-driving cars, instead of on proven modes of transit like public transport.

The futurist writers peddling this stuff need to take a moment to chill and learn about the actual state of the underlying technology.


It's not in their interest to chill and learn. It's in their interest to hype and sell.


There's very few people in the whole ecosystem who take home a bigger paycheck if they chill and learn. Earth gets what Earth pays for.


Fortune news: "Computer AI Can Now Read Better Than You Do" "Alibaba has developed an artificial intelligence model that scored better than humans"

Bloomberg news: "Alibaba says it’s the first time a machine outperformed people" "China’s Plan for World Domination in AI"


Man. When you said that I was really hoping you were being hyperbolic. Then you pasted a link. The world is a sad sad place.


Go ahead and read https://seekingalpha.com/news/3322771-ai-beats-humans-stanfo... the comments that many investors make on this achievement are worrying when you consider many may buy in for this(or this kind of reason)


I currently am working on improving a existing QA style dataset and am exploring the best way to have Machine Reading Comprehension stop being about span selection and move to some kind of reasoning. Turns out making questions in a repeatable way that have some kind of reasoning is quite hard.


> The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]

Sounds like they've reinvented Jeopardy Watson's ability to excel at Q&A, but 12 years later.


Wish i had one of these for the SAT


Great result. At my job I manage a machine learning team and so I am fairly much all-in for deep learning to solve practical problems.

That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet. BTW, a good paper just released on the limitations of DL by Judea Pearl https://arxiv.org/abs/1801.04016


> That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet.

Well, that really made it much clearer to me ;)


He just means all the things that we already know don't work + something that we don't know about yet that will. (emphasis on the part we have no fucking clue about).


I definitely hope we invent something better than vaguely defined PGMs that are a nightmare to get right and usable.


Wow thanks for that Pearl link! I know how I'll be spending my afternoon......


It would be interesting to know how well some of the entries on the Squad page do for the Winograd Schema challenge (https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS....). Does anyone know if any of the systems have been tested on that as well?


I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with. If you read the paper, it goes on to say, in regards to human evaluation:

> Mismatch occurs mostly due to inclusion/exclusion of non-essential phrases (e.g., monsoon trough versus movement of the monsoon trough) rather than fundamental disagreements about the answer.

I don't think I would call that "error," rather than ambiguity. In other words, there's more than one possible answer to the questions under these criteria -- English isn't a formal grammar where there's always one and only one answer. For instance, here's one of the questions from the ABC Wikipedia page:

> What kind of network was ABC when it first began?

> Ground Truth Answers: radio network radio radio network

> Prediction: October 12, 1943

Because the second human said "radio" instead of "radio network," I believe this would count as a human miss. But the answer is factually correct. Meanwhile, the prediction from the Stanford logistic regression (not the more sophisticated Alibaba model in the article, where I don't think results are published at this detail) is completely wrong. No human could make that mistake. And yet these are treated as equally flawed answers by the EM metric.

And yet this gets headlined as "defeats humans," not "learns to mimic human responses well."


> I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with

For some problems, sure. For prediction tasks on the other hand, you have an actual ground truth that can be compared to human a priori prediction.

Neural net NLP results are rarely about actual intelligence or clever use of latent variables that it figured out, and more "pattern matching" that explains why its errors are so different from human errors. It doesn't actually understand the problem, it's finding tricks to solve the questions that happen to be regularities in the dataset that us humans can't really see.


Thinking about this further. If the computer answers "radio," it scores a correct answer on the EM metric, even though that answer counts as a miss for the humans, assuming that I am reading the paper correctly. That seems like a bad way to evaluate this.


How well do these do on Winograd challenges?

https://aaai.org/Conferences/AAAI-18/aaai18winograd/


This is extractive question answering rather than reasoning, so this will be challenging for it.

Nevertheless, most extractive systems learn some degree of co-reference resolution.

I have a less advanced system than the Alibaba one, and it got both example questions correct:

The trophy would not fit in the brown suitcase because it was too big. What was too big?

and

The town councilors refused to give the demonstrators a permit because they feared violence. Who feared violence?


This is clickbait. Unless models are invariant to adversarial examples in SQuAD such as those described here: https://arxiv.org/abs/1707.07328, models doing really well on SQuAD doesn't mean a ton.


Can't they simply include adversarial examples in the unreleased test set?


At NIPS 2017 there was a system which beat humans in a college QuizBowl competition. In many ways I think that was more impressive than excellent performance on SQuAD.


Kudos to my colleagues. The iDST team is based in Bellevue, WA and hiring more people. Let me know if you're interested.

Also, the Alibaba Cloud is looking for engineers. Pls check https://careers.alibaba.com/positionDetail.htm?positionId=b7...


@syllogism, have you thought about a demo combining spaCy + ____ to tackle SQuAD (https://rajpurkar.github.io/SQuAD-explorer/)?


DrQA can use Spacy as a tokenizer and scores about a 2 points less on SQuAD. https://github.com/facebookresearch/DrQA#tokenizers


A counterpoint from Yoav Goldberg:

http://u.cs.biu.ac.il/~yogo/squad-vs-human.pdf


is this still impressive in 2018? I honestly don't know


Yeah, I don't think it is. Year 2017 brought out some really bad humans, and it turns out, you don't really need better algorithms to beat them in linguistics. See for yourself in this video: https://www.youtube.com/watch?v=L0eY5TGEK2I


Cool. An AMP page. Makes it look like Google published this article.



Seeing (google.com) in the title definitely influenced my click.


People who think AMP is a solution and not just another problem need to take this effect into account. Google isn’t just acting like a CDN here, they’re being intentionally misleading.


Normally I'd agree, but I just got a redirect to the ZDNet page.

Did someone change the link?


Maybe not intentional but certainly misguided.



Flagged and voted this up. HN should scan for AMP links and ask the submitter to fix the link.


I realized my mistake after posting, but couldn't edit the URL. Sorry about that.


Despite all the problems with AMP that have been discussed elsewhere, the real zdnet page hijacked my browser, redirecting me to an ad page with no way to get back. In cases like this, I think AMP would be preferable.


[flagged]


What? It’s a swimsuit for kids. Am I missing something here?


I think the parent is concerned that it has the word 'sexy' in the title. In reality I suspect that this is just absentminded SEO tactics, or the result of an automated system. Even easier to do if english isn't your native language and it just becomes "put these magic words in the title to rank higher"


Huh? That's a very normal looking swimsuit. It just has the word "sexy" in the description. Sometimes children even swim naked! I think you're being too puritanical. It's not child abuse to sell children's clothes.

I should add that peadophilia is a sexuality, not a crime. Not all peadophiles abuse children and we should stop painting them all with the same broad brush as child abusers.


Yes, not all pedophiles act on their urges, yes, it may be stupid SEO tactics, yes, OP is a prude, but that whole page is just begging to be labeled as pedophilic in nature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: