Outrageously Large Neural Nets: Sparsely-Gated Mixture-of-Experts Layer (2017)

merricksb · on June 8, 2019

Discussion 2 years ago at time of publication:

https://news.ycombinator.com/item?id=13518039

jcims · on June 8, 2019

N00b here.

I can’t tell if this is primarily intended to provide effective sharding of a largely homogeneous network, or if it’s intended to allow for incorporation of diverse networks and use the gating to classify and route the inputs to the appropriate networks.

hadsed · on June 8, 2019

Heh. I'm not sure you're wrong to say it either way.

sgillen · on June 8, 2019

This work uses conditional computation to allow for "Outrageously large" networks which may work better in practice but which will be even harder to understand.

I'm very interested in working on using this same sort of conditional computation to make reasonably sized neural networks easier to understand. Has anyone seen papers on this sort of work?

plutonorm · on June 8, 2019

What is it with needing to understand NNs? Its the same thing as not trusting a human to drive a car because you cannot understand every stage Of computation happening in their brain. If a neural network learns a task, test it well enough to know that it performs well enough in target domain before using it. Don't expect to be able to understand how it works and from there claim to know that it will work well and have more confidence in it. This approach barely works in standard software, let alone a neural network. Stop worrying and learn to love the NN, after all it is a mirror of your own ineffable nature.

varjag · on June 8, 2019

"Those are scary things, those gels. You know one suffocated a bunch of people in London a while back?"
Yes, Joel's about to say, but Jarvis is back in spew mode. "No shit. It was running the subway system over there, perfect operational record, and then one day it just forgets to crank up the ventilators when it's supposed to. Train slides into station fifteen meters underground, everybody gets out, no air, boom."
"These things teach themselves from experience, right?," Jarvis continues. "So everyone just assumed it had learned to cue the ventilators on something obvious. Body heat, motion, CO2 levels, you know. Turns out instead it was watching a clock on the wall. Train arrival correlated with a predictable subset of patterns on the digital display, so it started the fans whenever it saw one of those patterns."

"Yeah. That's right." Joel shakes his head. "And vandals had smashed the clock, or something."

sgt101 · on June 8, 2019

To summarise: if a system is not understood there exists the possibility of sudden, unexpected harm. The system is unsafe. This is (probably) ok if the system is putting icing on donuts (you might get a bad batch) but is definitely not ok if the system is deciding on dosing levels for drugs or controlling machines that could suddenly smash into queues of school children.

varjag · on June 8, 2019

Moreover, if (as customary in all technology) you build systems upon these systems, even the low malfunction probabilities will multiply into nearly assured failures. With a system you understand you can find the cause and fix the issue for all: this what allows us to build ever more complex systems over decades of engineering R&D.

But for a system you don't understand you are at whim of cascading patterns of errors in underlying behavior.

p1esk · on June 8, 2019

Two counter examples:

1. A cpu you rely on is a well understood system yet unexpected failures do happen (eg pentium bug, spectre exploits, etc).

2. A human you rely on might fail unexpectedly (tired, drunk, heart attack, going crazy, embracing terrorism, etc).

After testing reasonable number of things, if NNs perform more reliably than those other systems we rely on currently, it will be increasingly harder to justify using the other systems, especially when that means more people accidentally dying every year.

sgt101 · on June 9, 2019

For example 2, we hold the human accountable and we have a system of social and administrative controls that attempt to mitigate these risks. For example, if you are a pilot and you show up drunk the flight crew will report you... because they don't want to be killed in the crash you may precipitate.

For 1 - well, actually CPU's are not as well understood as they might be. There are verified CPU's and stacks, but the pace of Moore's law and the cost of verification has meant that the performance gaps between them and the commodity chips has been huge. In 20/30 years time I expect that that gap will be small and we will see investment in fully verified stacks for many applications.

If the argument about accidental death reduction were true then wouldn't we have abandoned the use of personal cars and instead be insisting on public transit only? I believe that deaths per mile travelled on rail is far lower than for automobiles (in the US - which appears to have the worst record for rail one death per 3.4 billion passenger-km, for roads one death per 222 million passenger-km, that's an order of magnitude.)

[https://pedestrianobservations.com/2011/06/02/comparative-ra...]

varjag · on June 8, 2019

I believe "With a system you understand you can find the cause and fix the issue for all" covers the first case. The second case is applicable to both NN and traditional control systems.

p1esk · on June 8, 2019

So which system would you personally prefer to rely on in life or death situation, the one that is well understood (accident rate 0.0001%), or poorly understood (accident rate 0.000001%)?

dTal · on June 8, 2019

Source: Peter Watts, Starfish

taneq · on June 8, 2019

Peter Watts, +1. :)

To paraphrase some reviewer, "whenever I feel my will to live growing to strong I read some of this guy."

Edit: Buy the Firefall compendium. Just do.

ploika · on June 8, 2019

Because "computer says so" isn't good enough when your decisions affect someone else's life. The standards are set in other areas of statistical modelling, not other areas of software development.

There are countless examples in the public domain of AI systems improperly discriminating on race and gender when processing job applications, aiding medical diagnoses or targeting ads for housing and career opportunities. These outcomes can vary from annoying to morally questionable to explicitly illegal, and are all the more common when you don't really understand what your model is doing.

Statistical inference is important, and it's hard to get right. Centuries of thought have been put into methods to explain the effect of X on Y, accounting for Z. It's not the same goal as maximising the AUC or minimising the MSE. In many cases too, it's far more important.

p1esk · on June 8, 2019

If “computer says so” and it’s been shown to be right more often than my doctor, I will trust the computer more. I already have very little trust in what doctors say, because their error rates are pretty high.

ploika · on June 8, 2019

You're talking about prediction and classification, not inference.

Why was I denied car insurance? How come I didn't get past the initial algorithmic screening phase of that job application? If you can't answer those questions properly you're in legal trouble in many jurisdictions.

So going back to the original comment, that's why people are trying to understand neural networks, and why statistical inference and not just raw predictive power is important.

p1esk · on June 8, 2019

Would you prefer the model which is more correct, or more explainable? Let's assume for a second it's one or another.

bumby · on June 8, 2019

Because human interaction (whether human to human or human to machine) rely heavily on trust. Humans implicitly trust other humans more because we've evolved to have empathy. It makes it easier to predict what will happen with another human because we can model their behavior by putting ourselves in their shoes. That's also why we are so uneasy around those who we cannot empathize with (e.g., socipaths).

While validation is a proxy for this trust, it's not the same because the modeling may be incomplete or we don't have the same evolutionary heuristics to understand aberrant behavior. So we tend to want to interpret the model explicitly.

rq1 · on June 8, 2019

https://arxiv.org/pdf/1807.06699

hadsed · on June 8, 2019

Can you elaborate on what you think might make it easier?

sgillen · on June 8, 2019

I’m imagining that by having conditional computation you’re basically partitioning the state space up, with each part of the space getting its own smaller network. this might allow you the generality of a larger network, but with the ability to hone in and analyze the smaller networks for certain parts of your state space.

nl · on June 8, 2019

Neural networks are NOT hard to understand.

Tools like LIME[1] have been around for a few years, and work really well for inspecting and understanding the decisions made by NNs.

It's true that complex decisions are hard to explain, but LIME does a better job than (say) anyway of trying to explain a SVN+Human coded feature.

LIME has been out since 2016. It and similar techniques are widely used in production systems.

It's time this "NNs are hard to understand" idea died.

[1] https://github.com/marcotcr/lime

p1esk · on June 8, 2019

How does it explain adversarial examples where changing a single pixel leads to class prediction change?

nl · on June 10, 2019

https://openreview.net/pdf?id=BkpiPMbA-

Code: https://github.com/sunblaze-ucb/decision-boundaries

p1esk · on June 10, 2019

The problem is we have very little understanding of how decision boundaries are formed, and very little intuition about how the input manifold “looks like”. That’s why it’s so difficult to make sure the boundaries have large enough margin (i.e. robust for all reasonable samples).

After reading papers like this it feels like while we make some progress towards understanding some aspects of NN operation, we uncover deeper aspects that we don’t understand. This process itself looks like gradient descent (saddles everywhere!), and we are still pretty far from the global minima. :)

nl · on June 10, 2019

Sure.

But have you ever tried explaining a SVN decision to an actual human before?

I'm not convinced that a NN is any worse than that.

p1esk · on June 8, 2019

This is from 2017, so probably obsolete by now.

sanxiyn · on June 8, 2019

It is. For example, it uses LSTM, which is obsolete now.

currymj · on June 8, 2019

People keep saying this with extreme confidence; I’m not sure I buy it.

Certainly recurrent networks in general are not obsolete, even if attention/convolution works better for some applications.

Perhaps one ought to try GRU before LSTM but there’s no reason to suppose that it would dominate in all cases.

terminalhealth · on June 8, 2019

Indeed. Here is a very fresh paper finding that attention is certainly not all you need as sometimes recurrence is necessary.

https://arxiv.org/abs/1906.01603

This is also obvious: Without recurrence you cannot remember information that is not externally visible, but it may be computationally very convenient and often necessary to maintain information that is hidden.

The hard part is learning reps for hidden information as recurrences are plagued by vanishing and shattering gradients.

xyproto · on June 8, 2019

What is the modern choice now, instead of LSTM?

avinium · on June 8, 2019

Transformer.