Alibaba neural network defeats human in global reading test

cs702 · on Jan 15, 2018

This is not quite human-level question-answering in the everyday sense of those words. The ZDNet headline is too clickbaity for my taste.

The answer to every question in the test is a preexisting snippet of text, or "span," from a corresponding reading passage shown to the model. The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]

Actual current results:

https://rajpurkar.github.io/SQuAD-explorer/

Paper describing the dataset and test:

https://arxiv.org/abs/1606.05250

[a] If this explanation isn't entirely clear to you, it might help to think of the problem as a challenging classification task in which the number of possible classes for each question is equal to the number of possible spans in the corresponding reading passage.

ClintEhrlich · on Jan 15, 2018

Agreed, but it was the least clickbaity headline I saw about this result.

Compare: "ROBOTS CAN NOW READ BETTER THAN HUMANS, PUTTING MILLIONS OF JOBS AT RISK" http://www.newsweek.com/robots-can-now-read-better-humans-pu...

cs702 · on Jan 15, 2018

Jeez...

Before you blink an eye there will be some MBA-types working on PowerPoint proposals with detailed cost-benefit analyses for using those new AI machines they heard about that can read better than human beings. Needless to say, the technology will fall far short of expectations.

This is why there have been two AI winters already.

noobermin · on Jan 15, 2018

I think it's incumbent on people like you to get the word out that ML isn't going to put everyone's jobs at risk. Between this and self-driving cars, local governments are beginning to weigh spending tax dollars on these boondoggles, for example, self-driving cars, instead of on proven modes of transit like public transport.

The futurist writers peddling this stuff need to take a moment to chill and learn about the actual state of the underlying technology.

pishpash · on Jan 15, 2018

It's not in their interest to chill and learn. It's in their interest to hype and sell.

Eliezer · on Jan 16, 2018

There's very few people in the whole ecosystem who take home a bigger paycheck if they chill and learn. Earth gets what Earth pays for.

rsiqueira · on Jan 19, 2018

Fortune news: "Computer AI Can Now Read Better Than You Do" "Alibaba has developed an artificial intelligence model that scored better than humans"

Bloomberg news: "Alibaba says it’s the first time a machine outperformed people" "China’s Plan for World Domination in AI"

zodPod · on Jan 15, 2018

Man. When you said that I was really hoping you were being hyperbolic. Then you pasted a link. The world is a sad sad place.

danielcampos93 · on Jan 16, 2018

Go ahead and read https://seekingalpha.com/news/3322771-ai-beats-humans-stanfo... the comments that many investors make on this achievement are worrying when you consider many may buy in for this(or this kind of reason)

danielcampos93 · on Jan 16, 2018

I currently am working on improving a existing QA style dataset and am exploring the best way to have Machine Reading Comprehension stop being about span selection and move to some kind of reasoning. Turns out making questions in a repeatable way that have some kind of reasoning is quite hard.

randcraw · on Jan 16, 2018

> The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]

Sounds like they've reinvented Jeopardy Watson's ability to excel at Q&A, but 12 years later.

skate22 · on Jan 16, 2018

Wish i had one of these for the SAT

mark_l_watson · on Jan 15, 2018

Great result. At my job I manage a machine learning team and so I am fairly much all-in for deep learning to solve practical problems.

That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet. BTW, a good paper just released on the limitations of DL by Judea Pearl https://arxiv.org/abs/1801.04016

jacquesm · on Jan 15, 2018

> That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet.

Well, that really made it much clearer to me ;)

ianamartin · on Jan 15, 2018

He just means all the things that we already know don't work + something that we don't know about yet that will. (emphasis on the part we have no fucking clue about).

bitL · on Jan 16, 2018

I definitely hope we invent something better than vaguely defined PGMs that are a nightmare to get right and usable.

arstin · on Jan 15, 2018

Wow thanks for that Pearl link! I know how I'll be spending my afternoon......

Jach · on Jan 15, 2018

It would be interesting to know how well some of the entries on the Squad page do for the Winograd Schema challenge (https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS....). Does anyone know if any of the systems have been tested on that as well?

cwyers · on Jan 16, 2018

I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with. If you read the paper, it goes on to say, in regards to human evaluation:

> Mismatch occurs mostly due to inclusion/exclusion of non-essential phrases (e.g., monsoon trough versus movement of the monsoon trough) rather than fundamental disagreements about the answer.

I don't think I would call that "error," rather than ambiguity. In other words, there's more than one possible answer to the questions under these criteria -- English isn't a formal grammar where there's always one and only one answer. For instance, here's one of the questions from the ABC Wikipedia page:

> What kind of network was ABC when it first began?

> Ground Truth Answers: radio network radio radio network

> Prediction: October 12, 1943

Because the second human said "radio" instead of "radio network," I believe this would count as a human miss. But the answer is factually correct. Meanwhile, the prediction from the Stanford logistic regression (not the more sophisticated Alibaba model in the article, where I don't think results are published at this detail) is completely wrong. No human could make that mistake. And yet these are treated as equally flawed answers by the EM metric.

And yet this gets headlined as "defeats humans," not "learns to mimic human responses well."

rmellow · on Jan 16, 2018

> I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with

For some problems, sure. For prediction tasks on the other hand, you have an actual ground truth that can be compared to human a priori prediction.

Neural net NLP results are rarely about actual intelligence or clever use of latent variables that it figured out, and more "pattern matching" that explains why its errors are so different from human errors. It doesn't actually understand the problem, it's finding tricks to solve the questions that happen to be regularities in the dataset that us humans can't really see.

cwyers · on Jan 16, 2018

Thinking about this further. If the computer answers "radio," it scores a correct answer on the EM metric, even though that answer counts as a miss for the humans, assuming that I am reading the paper correctly. That seems like a bad way to evaluate this.

cscurmudgeon · on Jan 15, 2018

How well do these do on Winograd challenges?

https://aaai.org/Conferences/AAAI-18/aaai18winograd/

nl · on Jan 16, 2018

This is extractive question answering rather than reasoning, so this will be challenging for it.

Nevertheless, most extractive systems learn some degree of co-reference resolution.

I have a less advanced system than the Alibaba one, and it got both example questions correct:

The trophy would not fit in the brown suitcase because it was too big. What was too big?

and

The town councilors refused to give the demonstrators a permit because they feared violence. Who feared violence?

pegasos1 · on Jan 15, 2018

This is clickbait. Unless models are invariant to adversarial examples in SQuAD such as those described here: https://arxiv.org/abs/1707.07328, models doing really well on SQuAD doesn't mean a ton.

typon · on Jan 15, 2018

Can't they simply include adversarial examples in the unreleased test set?

nl · on Jan 15, 2018

At NIPS 2017 there was a system which beat humans in a college QuizBowl competition. In many ways I think that was more impressive than excellent performance on SQuAD.

wanghq · on Jan 16, 2018

Kudos to my colleagues. The iDST team is based in Bellevue, WA and hiring more people. Let me know if you're interested.

Also, the Alibaba Cloud is looking for engineers. Pls check https://careers.alibaba.com/positionDetail.htm?positionId=b7...

Xeoncross · on Jan 15, 2018

@syllogism, have you thought about a demo combining spaCy + ____ to tackle SQuAD (https://rajpurkar.github.io/SQuAD-explorer/)?

nl · on Jan 15, 2018

DrQA can use Spacy as a tokenizer and scores about a 2 points less on SQuAD. https://github.com/facebookresearch/DrQA#tokenizers

stablemap · on Jan 16, 2018

A counterpoint from Yoav Goldberg:

http://u.cs.biu.ac.il/~yogo/squad-vs-human.pdf

anorphirith · on Jan 15, 2018

is this still impressive in 2018? I honestly don't know

js8 · on Jan 15, 2018

Yeah, I don't think it is. Year 2017 brought out some really bad humans, and it turns out, you don't really need better algorithms to beat them in linguistics. See for yourself in this video: https://www.youtube.com/watch?v=L0eY5TGEK2I

spiderfarmer · on Jan 15, 2018

Cool. An AMP page. Makes it look like Google published this article.

sctb · on Jan 15, 2018

We've updated the link from https://www.google.com/amp/www.zdnet.com/google-amp/article/....

hughes · on Jan 15, 2018

Seeing (google.com) in the title definitely influenced my click.

freehunter · on Jan 15, 2018

People who think AMP is a solution and not just another problem need to take this effect into account. Google isn’t just acting like a CDN here, they’re being intentionally misleading.

MR4D · on Jan 15, 2018

Normally I'd agree, but I just got a redirect to the ZDNet page.

Did someone change the link?

spdionis · on Jan 15, 2018

Maybe not intentional but certainly misguided.

msla · on Jan 15, 2018

Real link:

http://www.zdnet.com/article/alibaba-neural-network-defeats-...

jnordwick · on Jan 15, 2018

Flagged and voted this up. HN should scan for AMP links and ask the submitter to fix the link.

ClintEhrlich · on Jan 15, 2018

I realized my mistake after posting, but couldn't edit the URL. Sorry about that.

Wehrdo · on Jan 15, 2018

Despite all the problems with AMP that have been discussed elsewhere, the real zdnet page hijacked my browser, redirecting me to an ad page with no way to get back. In cases like this, I think AMP would be preferable.

yipopov · on Jan 15, 2018

[flagged]

tzahola · on Jan 15, 2018

What? It’s a swimsuit for kids. Am I missing something here?

chillacy · on Jan 16, 2018

I think the parent is concerned that it has the word 'sexy' in the title. In reality I suspect that this is just absentminded SEO tactics, or the result of an automated system. Even easier to do if english isn't your native language and it just becomes "put these magic words in the title to rank higher"

averagewall · on Jan 15, 2018

Huh? That's a very normal looking swimsuit. It just has the word "sexy" in the description. Sometimes children even swim naked! I think you're being too puritanical. It's not child abuse to sell children's clothes.

I should add that peadophilia is a sexuality, not a crime. Not all peadophiles abuse children and we should stop painting them all with the same broad brush as child abusers.

nurettin · on Jan 16, 2018

Yes, not all pedophiles act on their urges, yes, it may be stupid SEO tactics, yes, OP is a prude, but that whole page is just begging to be labeled as pedophilic in nature.