Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer

2bitencryption · on Jan 30, 2017

Designing a neural network is a thousand times harder than I imagined.

After AlphaGo, I tasked myself with creating a neural network that would use Q-Learning to play Reversi (aka Othello).

At that point, I had already utilized Q-Learning (the tabular version, not using a neural network) for some very simple and mostly proof-of-concept projects, so I understood how it worked. I read up only perceptrons, relu, the benefits/disadvantages of having more/fewer layers, etc.

Then I actually started on the project, thinking "I know about Q-Learning, I know about neural networks, now I just need to use Keras and I'll have a network ready to learn in about twenty lines of python."

Boy was that naive. Regardless of how much you understand the CONCEPTS of neural networks, actually putting together an effective one that matches the problem state perfectly is so, so difficult (especially if there are no examples to build off of). How many layers? Dropout or no, and if so, how much? Do you flatten this layer, do you use relu, do you need a SECOND neural network to approximate one part of the q-function and another to approximate a different part?

I spent MONTHS messing with the hyperparameters, and got nowhere because I'm doing this on a desktop pc without CUDA, so it takes days to train a new configuration only to find out it hardly "learned" anything.

At one point after days of training, my agent actually had a 90% LOSE rate against an opponent that played totally randomly. To this day I am baffled by this.

I went into the project thinking "I have this working with a table, the q-learning part is in place -- just need to drop in a neural net in place of the table and I'm good to go!" It's been almost a year and I still haven't figured this thing out.

karpathy · on Jan 30, 2017

If it makes you feel any better, I've been doing this for a while and it took me last ~6 weeks to get a from-scratch policy gradients implementation to work 50% of the time on a bunch of RL problems. And I also have a GPU cluster available to me, and a number of friends I get lunch with every day who've been in the area for the last few years.

Also, what we know about good CNN design from supervised learning land doesn't seem to apply to reinforcement learning land, because you're mostly bottlenecked by credit assignment / supervision bitrate, not by a lack of a powerful representation. Your ResNets, batchnorms, or very deep networks have no power here.

SL wants to work. Even if you screw something up you'll usually get something non-random back. RL must be forced to work. If you screw something up or don't tune something well enough you're exceedingly likely to get a policy that is even worse than random. And even if it's all well tuned you'll get a bad policy 30% of the time, just because.

Long story short your failure is more due to the difficulty of deep RL, and much less due to the difficulty of "designing neural networks".

akhilcacharya · on Jan 30, 2017

If we're not researchers, when is it still even neccessary to develop new architectures? Seems like these days most DL applications in Computer Vision just use pretrained models with transfer learning.

anigbrowl · on Jan 30, 2017

If you're not a researcher, why are you doing it? If you have any other motivation besides discovery, expect that to get automated away in short order and whatever you do to rapidly become obsolete. Curiosity is your best asset.

akhilcacharya · on Jan 31, 2017

Many image problems can still be solved by machine learning solutions - if I don't need to spend time re-inventing the wheel, why do so?

option_greek · on Jan 30, 2017

Pardon my naive question: Is there any point to RL apart from automatically generating labels to a SL network ?

karpathy · on Jan 30, 2017

Not sure if I understand "automatically generating labels to a SL network". I don't believe RL is used in this setting.

RL is about learning expected-reward-maximizing policies for environments that you get to interact with. Common benchmarks currently mostly include games (e.g. ATARI, AlphaGo, VizDoom), physics-based animation (e.g. https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/index.html), or (simulated) robotics-like tasks (e.g. MuJoCo). But the core algorithms (such as policy gradients) can be used more generally in settings that don't necessarily look like environments as usual, but where you want to train a network with stochastic nodes, such as in hard attention, etc.

RL is a funny area; A lot of AI researchers get excited about it (mostly motivated by its promise as the formalism that leads to AGI), and yet despite the hype it has so far had very little impact in the industry so far (the Google data center application possibly being an exception, though it was more "RL" than RL, with quotes). It has some promise for Robotics in the real world, but not applied directly and naively. The way that will play out is likely through behavior cloning on human demonstrations or on outputs of trajectory optimizers from simulation, or possibly RL fine-tuning in simulation transferred to real world. But it's still quite early to tell.

On this topic, fun story, the most impressive robots I'm aware of right now are from Boston Dynamics and as they mentioned at this year's NIPS they use ZERO machine learning. Forget deep learning or even deep reinforcement learning. Zero Machine Learning.

I gave a talk last week about some of our RL experiments @ OpenAI and someone came to me after the talk, described their (straight forward) supervised learning problem and asked me how they can apply RL to it. This, to me, is an alarming sign of damaging hype to the community. You don't use RL for your SL problems. You can if you really want to (e.g. reward = 1.0 if you guess the correct label or -1.0 otherwise), but you really don't want to. You're lucky, use your labels, business as usual.

option_greek · on Jan 30, 2017

Thank you for the explanation. It somehow seems to me that most of the RL problems can be converted to some form of SL. For example can the pong RL solution using PG (thank you so much for that article btw) not be converted to SL by recording a human player actions for a while and labeling them based on rewards achieved ?

gugagore · on Jan 30, 2017

Learning a human's actions in most cases is probably a supervised learning problem. But in that case, you actually don't even want to look at the rewards. You just want to know what a human did given a specific scenario.

However, any time you have a reward signal (like the score of the game) in a multi-step decision problem, like a game where you take actions sequentially (e.g. once per turn), you need RL machinery to make sense of the data. Maybe you take an action now, and you might only reap the reward of that action in the future. So how do you "label" the action right now? You label it with some measure that takes into account the future of the reward signal.

So some human plays a game and gets a super high score. You only see that they got a high score at the end of the game. How do you go back and label the 150 actions that led you to the score? That's the part that is RL.

option_greek · on Jan 31, 2017

Thank you. It makes more sense now.

spangry · on Jan 30, 2017

Honestly I'm still struggling to grasp with RL is. Can it roughly be conceptualised as: a neural network wrapped in a genetic algorithm that re-writes the structure/parameters of the network at the end of each 'pass' (according to some target you're optimising for to determine model 'fitness')?

If this is roughly what it is, it sounds like a logical place to apply some kind of automated meta-programming. I recently stumbled across a Python ML framework that I think used Jinja2 templates for this purpose...

mattkrause · on Jan 31, 2017

Focusing on the "neural network" part might be confusing you.

Classification/supervised learning is essentially about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.

Reinforcement learning, in contrast, is fundamentally about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation). Games are a particularly good test-bed for reinforcement learning because they have fairly clear states (I have these cards in my hand, or that many lives, etc), actions ("hit me!", "Jump up") and rewards (winnings, scores, levels completed). There's also an obvious parallel with animal behavior, which is where the name originated.

In both cases, neural networks are useful because they are universal function approximators. There's presumably some very complex function that maps data onto labels (e.g., pixels onto {"DOG", "CAT"}) for supervised learning, and states onto action sequences for reinforcement learning. However, we usually don't know what that is, and can't fit it directly, so we let neural networks learn it instead. However, you can do both supervised learning and reinforcement learning without them (in fact, until recently, nearly everyone did).

However, the network typically doesn't get "rewritten" on the fly. Instead, it does something like estimate the value of a state or state-action pair.

spangry · on Feb 1, 2017

Thanks for taking the time to explain this to me, it's definitely helped my understanding. It sounds somewhat 'static', assuming I haven't misinterpreted. Where does the 'learning' part come in? Again, this is almost certainly due to my lack of knowledge, but it sounds like RL essentially brute-forces the optimal inputs for each statically defined 'action function'. Meaning the usefulness of the model depends entirely on how well you've initially specified it, meaning the problem is really solved through straight-forward analysis.

(I've obviously gone wrong somewhere here... Just walking you through my thought process)

mattkrause · on Feb 1, 2017

You're welcome!

The agent "learns" by stumbling around and interacting with its environment. At the beginning, its behavior is pretty random, but as it learns more and more, it refines its "policy" to collect more rewards more quickly.

Brute force is certainly possible for some situations. For example, suppose you're playing Blackjack. You can calculate the expected return from 'hitting' (taking another card) and 'standing' (keeping what you've got), based on the cards in your hand and the card the dealer shows.

So...brute force works for simple tasks, but in a lot of situations, it's hard to enumerate all possible states (chess has something like 10^47 possible states) and state-action pairs. It's also difficult to "assign credit"--you rarely lose a chess game just because of the last move. These make it difficult to brute-force a solution or find one via analysis. However, the biggest "win" for using RL is that it's applicable to "black box" scenarios where we don't necessarily know everything about the task. The programmer just needs to give it feedback (though the reward signal) when it does something good or bad.

Furthermore, depending on how you configure the RL agent, it can react to changes in the environment, even without being explicitly reset. For example, imagine a robot vacuum that gets "rewarded" for collecting dirt. It's possible that cleaning the room changes how people use it and thus, changes the distribution of dirt. With the right discounting setup, the vacuum will adjust its behavior accordingly.

ced · on Jan 30, 2017

Do you know about expected utility? Optimal behaviour (of any kind) can be framed as "At each step, pick the action that maximizes your expected utility." So, for instance, you might study hard tonight because it'll lead you to pass your exam tomorrow and get a high-paying job later. In that scenario, studying's utility is higher than going out for a beer.

Reinforcement learning's goal is either to estimate each action's expected utility (possibly using neuron networks), or to directly learn what the best action to take is in any given situation, without bothering with utility estimation.

posterboy · on Jan 30, 2017

robotic movement is maybe not exactly an examplary domain, because biologically a good share is controlled by the parasympathic nervous system, ie reflexes. Maybe robotics needs to get that right first with a strong focus on the mechatronics before the software becomes relevant, just as bipedal movement requires from a baby to build up the muscles.

On the other hand, maybe evolution of a nervous system is analogous to some form of neural learning, but on a huge timescale, and maybe the scale is proportional to the searchspace complexity.

edit: Well, reflexes are also propagated by neurons, but on a short circuit and thus readily modeld by PDI etc. Noticing the similar (?) reliance on differentials and coefficients, the advantage is that a subset of PDI has perfect solutions.

pilooch · on Jan 30, 2017

Any link to that talk, or was it internal to openai ? Thanks :)

deepnotderp · on Jan 30, 2017

This 1000x. Drl is a totally different beast than imagenet.

Also listen to karpathy, he knows his stuff, rather than random me :D

philipov · on Jan 30, 2017

> At one point after days of training, my agent actually had a 90% LOSE rate against an opponent that played totally randomly.

Nice! You should have just added a bit at the end to invert whatever answer it got, and you would have had a winner.

But more seriously, I think that we will become more clever with designing genetic algorithms to evolve the neural networks as part of the training process rather than trying to build our own from scratch every time. I vaguely recall there is some research being done on that front already.

deepnotderp · on Jan 30, 2017

Neural architecture search with reinforcement learning. We've used an actor critic method internally with good results as well.

mrjoeblack · on Jan 30, 2017

The fact that a random opponent performs better means that simply inverting the output of a bad strategy (assuming that is even possible, in cases where the output is more complex than binary it should not be) would just give you another bad strategy.

philipov · on Jan 30, 2017

> ...would just give you another bad strategy.

That's the joke, but to be fair, inversion is not a binary concept. Negation is binary, but inversion is more general and has the 2D interpretation of reflecting something across X=Y.

deepnotderp · on Jan 30, 2017

1. For drl,you usually don't want a ResNet actually, since reward assignment is difficult.

2. Almost always,.5 is usually good, but you can tune this.

2. You usually flatten right before the final layers.

3. Yes, relu is preferred.

4. You're referring to double q networks, although this helps, optimality bounds are even better.

You'll learn eventually, don't worry. Cheers and welcome to deep learning! ;)

Aeolos · on Jan 30, 2017

> 2. You usually flatten right before the final layers.

The newer trend appears to be fully-convolutional networks even for classification, since they appear to overfit less, compared to flattening+dropout.

deepnotderp · on Jan 30, 2017

Yup, but he asked me about where to use flatten, not whether to use flatten. But you're right, fully convolutional is the way to go in classification.

bthornbury · on Jan 30, 2017

You said it. I've been working on some ConvNets for object localization in an image over the past couple weeks and it took days to figure out why my network just seemed to be randomly guessing (50% accuracy).

In the end, it was a reduction of the training rate (with SGD) that made things work in what felt like magic.

I've started reading the deep learning text book from Ian Goodfellow now (http://www.deeplearningbook.org). Hoping a solid foundation will build some intuitions for reasoning about these hyper parameters.

deepnotderp · on Jan 30, 2017

Haha, there's no intuition behind these. You should still read the book, but because of other reasons.

dweekly · on Jan 30, 2017

Let me know if you'd like an SSH login on a box with a Titan X Pascal and CUDA stack already installed. Happy to lend a hand and a few teraflops.

lateguy · on Jan 30, 2017

Hey,

Just curious is this offer for other people also? I am working on this https://github.com/deependersingla/deep_portfolio (first version open sourced) and before open-sourced this https://github.com/deependersingla/deep_trader (255 stars). I was using GTX 980-TI locally but system crashed two weeks ago. I am basically trying to find a RL agent which automatically optimize momentum strategy look_back period and generate portfolio according to optimisation one set on reward function. More can be read on project Readme. Thanks.

_pastel · on Jan 30, 2017

I wonder if some of your difficulties are related to the particular structure of Reversi.

The obvious heuristic in the endgame - maximize the number of your pieces - is actually reversed in the opening and midgame, where you want to minimize your number of pieces (other considerations like edge play and parity being equal). So it's possible your NN managed to do worse than random because it learned an endgame heuristic and generalized it improperly.

If so, you may want to consider explicitly wiring in the turn counter as an input.

thomasahle · on Jan 30, 2017

Doesn't reversi have a number of stones exactly equal to the turn? It seems strange if a network wouldn't be able to pick that up..

Al-Khwarizmi · on Jan 30, 2017

Another post of the "If it makes you feel any better" type: I'm a relatively established researcher in NLP, having worked with a variety of methods from theoretical to empirical, publishing in the top venues with decent frequency, and still I'm having a really hard time to get into the deep learning (DL) stuff.

I'm training a sequence-to-sequence model and have been tuning hyperparameters for the last 2-3 months. I'm making progress, but painfully slowly due to the large time it takes to train and test models (I have a local Titan X and some Tesla K80's in a remote cluster, to which I can send models expecting a latency of 3-4 days of queue and a throughput of around 4 models running simultaneously on average - probably more than many people can get, but still feels slow for this purpose) and the fact that hyperparameter optimization seems to be little more that blind guessing with some very rough rules of thumb. The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy. So sometimes I tweak a parameter and get improvements, but who knows if they are significant or just luck with the initialization. I would have to run every experiment with a bunch of seeds to be sure, but that would mean waiting even more for results and my research would be old before I got to state of the art accuracy.

Maybe I'm just not good at it and I'm a bit bitter, but my feeling is that this DL revolution is turning research in my area from a battle of brain power and ingenuity to a battle of GPU power and economic means (in fact my brain doesn't work much in this research project, as it spends most of the time waiting for results for some GPU - fortunately I have a lot of other non-DL research to do in parallel so the brain doesn't get bored). In the same line, I can't help but notice that most of the top DL NLP papers come from a very select few institutions with huge resources (even though there are heroic exceptions). This doesn't happen as much with non-DL papers.

Good thing that there is still plenty of non-DL research to do, and if DL takes over the whole empirical arena, I'm not bad at theoretical research...

Aeolos · on Jan 30, 2017

> The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy.

Set all your random seeds to something predefined, such as 42. Even though the exact randomness is OS-specific, this will at least rule out lucky runs from real hyperparameter improvements.

Al-Khwarizmi · on Jan 30, 2017

I alredy do that for reproducibility reasons, but I don't really think it takes luck out of the equation. 42 may be a great seed for a model with 400 cells per layer and a terrible seed for a model with 600 cells per layer, as the different layout will lead to a totally different distributions of the weights even if the seed remains the same.

Aeolos · on Jan 30, 2017

Indeed, but if performance is affected so much by the initialisation, then I would avoid random initialisation in the first place. There are various publications exploring different initialisation methods for various problems.

I'm afraid that I cannot go into more specific details right now, but you can get more stable training and faster convergence with a better initialisation strategy.

yorwba · on Jan 30, 2017

Couln't you use an initialization pattern that includes all the weights of the smaller layer in the larger layer? This would keep the behavior of a subset of units exactly the same, at least at initialization time.

cwilkes · on Jan 30, 2017

Maybe I'm missing something here but if the problem is that different starting positions return different answers when they should all converge to the same one -- doesn't that mean that there's a fundamental problem with the actions being taken?

Fixing the starting randomness seems to be like that old adage of a man with two clocks doesn't know what time it is so he throws away one of them.

Al-Khwarizmi · on Jan 31, 2017

There is a fundamental problem in general (not in my particular approach) which is that we don't know how to do non-convex optimization. There are many problems where it's just not possible with our current techniques to know if a minimum is local or global.

Of course, with better tuning one can obtain better optima in general (that's what the whole field is doing) and it's possible that I'm not applying the best techniques and I would get less randomness if I had a better model. But as far as I understand, even the best models can converge to different local optima depending on initialization.

Xcelerate · on Jan 30, 2017

You can always apply for compute time on a large research cluster. It's rare to not be associated with a university or national lab, but I don't think there's anything prohibiting an individual from getting time with a good enough proposal application.

anigbrowl · on Jan 30, 2017

This reminds me a bit of trying to make 'noodles' on modular synthesizers - a term for a piece that keeps playing and evolving by itself without further tweaking or patching, and preferably without falling into a simple periodic attractor. It's more of an intuitive skill than a science, though I've found Integrated Information Theory useful and relevant.

gambler · on Jan 30, 2017

A cynic in me says this is one of the reasons large companies like deep learning so much. Unlike some other branches of AI, deep learning is not something that can be easily used by individuals or smaller companies. It does not scale down very well if at all. So it makes a great basis for some expensive cloud service. They can develop it without fear that someone new will use it to disrupt their business.

tossaway322 · on Jan 30, 2017

This.

Why the broadspread interest in a technology so unpredictable, whose payoff is so little and that requires so much investment in hardware/developer time? Wouldn't you be better off learning tools you can understand and that can be used to build reliable and predictable programs?

Why not warn graduate students:

"You'll be working on this project for about two years. We don't know what to tell you about how to solve the problems involved. We don't have any good general guidelines; this field is changing all the time; nobody knows how these things work except in very broad terms (there is no explanatory power). Sometimes it takes years to find something useful. Luck is the key - if you're unlucky, you're screwed."

"Nothing useful will come of your work when you're done. At completion you'll be dumber than when you started, because you will not have learned anything useful except possibly 'patience', a trait not valued in our field and sometimes viewed as equivalent to 'stupidity' or 'stubbornness'. You will, as a result of working with deep learning/NN, forget that sometimes you must cut your losses and quit exploring a particular solution path. At the end, you _may_ get a Master's degree, if you can show you made a good effort or even some progress. Three months after graduation, everything you have done will be obsolete and no firm will hire you: those familiar with NN because your knowledge is obsolete; those not in NN because they see no particular value in your training."

dlss · on Jan 30, 2017

> At one point after days of training, my agent actually had a 90% LOSE rate against an opponent that played totally randomly. To this day I am baffled by this.

This is indeed baffling. My best guess: perhaps your error function was reversed? Getting to 90% loss sounds like it would require training.

raverbashing · on Jan 30, 2017

I tried the Cats/Dogs exercise on Kaggle

Getting TS to work was hard. Ended up opting for Keras/Theano, TS would be slower for the same problems CPU only

Then getting all tensor shapes to "fit together", to run, then to converge, getting the right non-linearity for each layer, then finding out overfitting is a B! with a capital B

It is hard

malandrew · on Jan 30, 2017

For all of you replying to this comment, what recommendations would you give for someone wanting to learn all this stuff?

general_ai · on Jan 30, 2017

Doing anything large on a machine without CUDA is a fool's errand these days. Get a GTX1080 or if you're not budget constrained, get a Pascal-based Titan. I work in this field, and I would not be able to do my job without GPUs -- as simple as that. You get 5-10x speedup right off the bat, sometimes more. A very good return on $600, if you ask me.

verbify · on Jan 30, 2017

If you're budget constrained, a cheaper card will still get you massive improvements. I'm on a gtx 970, and it far outstrips the cpu. Even a gtx 650 (about £65) should outperform a cpu.

kuschku · on Jan 30, 2017

For a student doing this in their free time, 600$ can be a huge sum, usually two months rent.

That’s not easily paid.

ssalazar · on Jan 30, 2017

Buddy up with a professor/grad student/research group who will let you steal some cycles.

kuschku · on Jan 30, 2017

That’s just so simple, eh? Except, if hundreds of students and research groups require cycles, and the combined computer cluster of several universities is already booked out for months (and wouldn’t even support TensorFlow in the first place, so I’d have to write my own system or port everything to the systems that support supercomputer architectures).

general_ai · on Jan 30, 2017

Spending 10 months instead of 1 or 2 and not getting anywhere is also not free.

kuschku · on Jan 30, 2017

The problem is that as a student I just can not, in any way, get the money for a 1080. The choice is spending 10 months, or not even starting.

general_ai · on Jan 30, 2017

[flagged]

grzm · on Jan 30, 2017

Please don't snipe at people. If you don't have something constructive and civil to say, please just don't comment.

general_ai · on Jan 30, 2017

I didn't snipe. This is life advice. Get a part time job. If you're in North America (Canada or US) and you can't squeeze out $400-600 from your budget over the period of a year, you're making excuses, pure and simple. Cook at home, drop cable subscription, don't go to Starbucks, do part time jobs, and so on and so forth. That's what students did back when I was one.

kuschku · on Jan 30, 2017

I’m in Germany, have a part time job, cook at home, have no cable subscription, don’t go to starbucks, buy food if possible at ALDI.

Rent has been going up every year, but wages haven’t, so by now for me, rent for a small apartment is over 60% of my monthly income.

The world has changed quite a bit since you were a student.

kevinwang · on Jan 30, 2017

Might not be applicable for you, but maybe see if you can get a grant from your school? I'm going to try that this quarter to see if I can get some gpus

general_ai · on Jan 30, 2017

My student years were in Moscow in early 90's amid chaos and hyperinflation. Even there, I would be able to scrape together a few hundred bucks for something that's critical to my academic success. Get a roommate, maybe? Find a better paying part time job? I'm having a hard time imagining someone who can program and who writes papers about machine learning not being able to find a decent paying part time job anywhere in the world.

kuschku · on Jan 30, 2017

> for something that's critical to my academic success

Well, that’s the problem. It’s not. I can pass this and get the same grade no matter if I have good results or not. With CPU training, the paper just ends up ignored.

If it was critical for my success, I’d definitely find a way to do it, but that’s the point, it’s not critical – it would just improve my results.

general_ai · on Jan 30, 2017

You're confusing your grade with academic success. Those two things are related, but not the same. If this is something you want to pursue, go after it. Grades matter, but they're not the only thing that matters.

kuschku · on Jan 31, 2017

I’m still in undergrad/studying for B.Sc., so it’s not like I’m expected to do actual science, but you’re right indeed.

gtani · on Jan 31, 2017

There was a thread recently, maybe on reddit, about a facebook group or somewhere people give away ec2 or azure credits to people who want to do HPC or deep learning or whatever. But i can't google it.

kuschku · on Jan 30, 2017

I just spent 3 months CPU-training a network for a paper I was writing, if I had any other options, you can be sure, I’d have used them.

michaelgrosner2 · on Jan 30, 2017

Or set up an EC2 GPU unit - spot prices are usually in the sub $.20/hour range.

reubenmorais · on Jan 30, 2017

EC2 GPUs are slower to train than local hardware and more expensive long term. The upside is being able to scale much more easily, but I'd definitely recommend a good consumer grade GPU over EC2 if you're planning on using it for months as opposed to days.

general_ai · on Jan 30, 2017

They can also be unceremoniously preempted in the middle of your week long training run.

solomatov · on Jan 30, 2017

And why Pascal based Titan? Is it the best investment in terms of performance per $ spent?

Also, how cost effective is it to use cloud GPUs for real world machine learning?

nl · on Jan 30, 2017

Cloud GPUs are cost effective if you need to either fine-tune a pretrained network (eg, use pretrained ResNet/VGG/AlexNet for custom classes, ie[1]) or for inference, or if you don't want the upfront costs.

A 4GB GTX1050 is ~$180. A p2 instance on Amazon is $0.9/hour. The cost effectiveness depends on if you have a PC already.

[1] https://blog.keras.io/building-powerful-image-classification...

dTal · on Jan 30, 2017

And the cost of your electricity, don't forget.

robotresearcher · on Jan 30, 2017

Current mean cost to US domestic consumers of $0.12 kW/h, says Google.

TPD Specs: GTX 1050 75W GTX 1080 180W Titan 250W

Triple these numbers for the overall machine's PSU spec.

So even the Titan costs $0.03 per hour to run, or maybe $0.10 if the rest of the machine is flat-out.

edit: $0.19 kWh for San Francisco residents. Domestic rates.

modeless · on Jan 30, 2017

Titan XP is the maximum single chip performance you can buy right now. $1200 is well worth it if ML is part of your career. The time saved will pay for it.

general_ai · on Jan 30, 2017

Cloud GPUs are not economical if you use them 24x7x365 (which for any serious deep learning researcher or engineer is usually the case). The only scenario I can think of in which they'd be more economical than something under your desk is when you need to run a massive and embarrassingly parallel workload. I.e. try training dozens of models at the same time with different hyperparameters, and run that for a few days. You could do it cheaper, but it would take a long time and it would be a massive pain in the ass, so you pay the pretty penny and get it done in a week.

For my needs I have a machine with a 2011-v3 socket, and four GTX1080 GPUs. Warms up my man cave pretty nicely in winter. I also have access to about a hundred GPUs (older Titans, Teslas, newer 1080s and Pascal Titans) at work that I share with others.

Now, regarding Titan. Titan is actually not that much faster than GTX1080, so in terms of raw speed there's no reason to pay twice as much. BUT, it has 4GB more RAM, which lets you run larger models. NVIDIA rightly decided that for a $100+/hr deep learning researcher $600 is not going to be that big of a deal, and priced the card accordingly. If your models fit into 8GB, you'll be better off buying two 1080's instead.

As to me, I'm thinking of replacing at least one of my 1080s with a Titan, to be able to train larger models.

On a purely TFLOP or even TFLOP/watt basis, it doesn't currently make sense to buy anything that doesn't run Pascal.

jamesk_au · on Jan 30, 2017

"It is our goal to train a trillion-parameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware."

aabajian · on Jan 30, 2017

I'm excited for the applications of deep learning in radiology. I'll be starting radiology residency in a couple years (intern year first).

Can someone explain whether this solves the problem of identifying multiple pathologies within the same image? For example, if a patient has pneumonia and a pleural effusion, could this method activate two subnetworks and come up with a consensus diagnosis based on both those networks? Or does only one pathway get activated at a time?

barbolo · on Jan 30, 2017

Sure you can. The term you are looking for is "multi task learning".

sjg007 · on Jan 30, 2017

Theoretically. You can have two models Kane trained on each pathology or one model trained on examples of both in the same image. But this poses a neat mixture of experts problem instrinsic to diagnosis.

ben_mann · on Jan 30, 2017

Impressive! With this work they managed to increase quality for most language pairs while halving runtime computational cost and decreasing training time. Since it's a gerneric component, I can see it getting used on all sorts of problems with lots of data.

Especially interesting given that most Kaggle competitions are won with mixture of experts models.

meow_mix · on Jan 30, 2017

More parameters is very exciting! I wonder if anyone here has read about what the relationship between the number of parameters on the network and the size of the dataset used on the network? Wouldn't such a large number of parameters risk serious overfitting by giving the model the capacity to fit to exact samples?

nshazeer · on Jan 30, 2017

I am one of the authors of "Outrageously Large Neural Networks". Yes - overfitting is a problem. We employed dropout to combat overfitting. Even with dropout, we found that adding additional capacity provides diminishing returns once the capacity of the network exceeds the number of examples in the training data (see sec. 5.2). To demonstrate significant gains from really large networks, we had to use huge datasets, up to 100 billion words.

petra · on Jan 30, 2017

Impressive work, mate!

Does mixture-of-experts works well the other way around, as a way to minimize power and hardware in common sized problems ?

And would it work in low resolution networks, like BinaryConnect ?

jamesblonde · on Jan 30, 2017

A good rule of thumb is 1 sample per parameter. So if you have 100m parameters, you should have 100m samples.

jureso · on Jan 30, 2017

Based on our current theoretical understanding the we should expect overfitting. In practice however, a very large number of parameters does not necessarily lead to overfitting. Clearly there is a gap between out theoretical understanding and practice. Check out a recent paper that explores exactly this questions:

https://arxiv.org/abs/1611.03530