While its useful to have this kind of info, IMHO its still far from 'infrastructure for deep learning'. What about model versioning? What about deployment environments? We need to address the whole lifecycle, not just the 'training' bit. This is a huge and underserved part of the problem bc people tend to be satisfied with having 1 model thats good enough to publish.
Indeed, deployment is a whole set of interesting issues. We haven't deployed any learned models in production yet at OpenAI, so it's not at the top of our list.
If the data and models were small and training was quick (on the order of compilation time), I'd just keep the training data in git and train the model from scratch every time I run make. But the data is huge, training requires clusters of machines and can take days, so you need a pipeline.
CTO of Algorithmia here. We've spent a lot of time thinking about the issues of deploying deep learning models. There are a whole set of challenges that crop up when trying to scale these kinds of deployments (not least of which is trying to manage GPU memory).
It would be interesting to compare notes since we have deployed a number of models in production, and seem to focus on a related but different set of challenges. kenny at company dot com.
Have you tried Sacred[1]? It definitely doesn't answer the "infrastructure for deep learning" challenge but it is helpful for understanding what experiments have been run/where did this model come from (including what version of the code/parameters produced it)
So true. I've been doodling some tools to somehow manage all of it. So far I only have git-like approaches to models and chef-like approaches to infrastructure. I hope to somehow bring all together into a docker-like package that can be deployed without much hassle.
You might want to check out Pachyderm -- that is essentially what they are trying to do (Analytics infrastructure support. It isn't specific to machine learning):
In terms of deploying trained models, you can probably get away with using TensorFlow Serving and let Kubernetes handle the orchestration and scaling part of the job. I do agree that there is certainly a need to have a tool that glues all these different bits and pieces together for improving the process of taking a model from development to production.
Agreed. A very interesting and thoughtful post, but I think that you are right that OpenAI's primary use cases seem to be (unsurprisingly) academic research and rapid prototyping of new ideas. These emphasize very different set of problems than, say, deploying something in production or as a service.
Thus, this post seems immensely useful to someone like me (a PhD student, also primarily concerned with exploring new ideas and getting my next conference paper), but I can see how others doing machine learning in-the-wild or in production might see a lot of questions left unanswered. I, for one, work primarily with health care data from hospital EHRs, and I spent a lot more time with data prep pipelines than folks working with, say, MNIST.
I don't know much about deep learning. Just noticed that there are 40+ upvotes and 0 comments. I propose the HN Bikeshedding effect theory. Take the number of comments and divide it by the number of upvotes.
<0.1 = Too technical for even HN audience
0.1-1.0 = At the right level for the HN audience
>1 = The topic is similar to painting the bike shed.
The high amount of upvotes and low ratio of votes/comments on deep learning/big data posts is unfortunately accurate.
It's not a problem that HN has topics which are frequently upvoted; topics such as employment and Rust are popular memes.
It is a problem, however, if the upvote-for-the-title crowd upvote articles which are bad and would not get upvotes if they were about another topic. That's a legit hard problem to solve (what makes a good submission?), unfortunately, but one I've been looking into.
(For clarity, this submission is a good submission, but I've seen quite a few top-ranking HN submissions that are just a bar chart on a controversial topic that is poorly sourced. And linkbait about deep leaning tends to get upvotes, but flagged too.)
There's a problem with upvoting on HN: upvote is also the only saving mechanism available: If I run across some click-baity title, but I don't have time to read it right now I will click upvote, but what I really mean by that is "save for later".
Maybe HN just needs to separate bookmarking and upvoting?
... HN has added favorites recently ("favorite" link, on the submission/individual comments page) and thus has both. And apparently really needs a more public changelog.
Have you considered that some posts may simply be a good read? I often don't comment because the post itself made a good job of stating a point. There is no need to reiterate what was said in a post just to get HN points.
That's already how HN works. Articles with too many comments get a huge penalty and get pushed off the front page quickly.
It has the effect of stifling a lot of interesting and important discussions and topics. I make an effort to up vote posts with too many comments to help them.
It's great to see people talking about the infrastructure they use to manage their deep learning workloads.
One area where we've had trouble with other orchestration tools (e.g. Docker Swarm) was in managing resources at anything beyond whole boxes. They are all good at managing CPU/RAM/Disk but we've had trouble with give this task GPU2. We had planned to try Mesos (given that we already run it for other things) but it sounds like maybe we should take a harder look at Kubernetes first.
> Like much of the deep learning community, we use Python 2.7
It's unfortunate that so much effort has been spent on bringing tools up to speed with Python 3, but some groups still insist on dragging their feet. I understand the motivation when we're talking about an established company with a huge legacy code base, but within the research community it's kind of embarrassing.
Python 2.7 is the present and future of scientific computing with Python. If there is one field in which a print statement is a critical feature it is quick interactive analysis prototyping. That won't change no matter how much Guido and co want it to change.
4 extra keystrokes. If you could add a statement I would add "p" and change my muscle memory to use that. For me it doesn't even need the ">>" syntax, just quick display. Rapid prototyping needs rapid feedback.
Edit: it's not just the quickness of a single statement. It's that the print statements are about 30-50% of the code when you are working this way.
I can see how you'd feel that way if 50% of your code is print statements. As a counterpoint, I write Matlab and Python code day in and day out, and print statements are relatively rare in my code. Same for most of my peers. I'm guessing we work in different research domains, but I can't imagine a good scenario where my research productivity becomes limited by how fast I can type print statements.
To be fair we've ingested a fair amount of python 2.7 research code for our Python 3 codebase and the print statements are the quickest of fixes. There are rarer actual gotchas, but 2to3 catches a fair number of them. We only switched for the machine learning project I'm working on right now as an experiment, but it's surprising how smoothly it's gone.
I'm not talking about it being a bug that needs a fix. That is easy enough in existing code. Im talking about when you are using Python like Matlab or Mathematica. Analyzing data and quickly viewing the results or subsets of the results.
You should probably use Jupyter notebook if you don't use it already. It's great for exploratory coding like data analysis. And as the last evaluated expression of a code block is automatically printed, no need for a print statement.
> Probably because that's not what it was designed for.
TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.
Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.
Both points are true: TF is obviously designed for scalable, distributed training, but it also heavily tied to Google's compute infrastructure (less so all the time, of course, but now it's also being closely tied to Google Cloud). So while I disagree with my colleague* that TF is "slow" or "not designed for distributed training," I support the slightly different (and implicit) argument that there are some settings (often in enterprise, I am learning) where it might not be as good a fit as other frameworks (e.g., DL4J, Caffe, whatever).
* Disclosure: I work with Skymind and contribute to DL4J, and I also use TensorFlow/Theano/keras heavily in my PhD research. I am an equal opportunity framework guy. ;)
The environment for your large clusters is almost certainly different from everyone using TF outside of Google. I'm speaking about the problems they'll run into.
Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works:
http://deeplearning4j.org/spark
Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.
We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.
This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.
Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)
If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.
Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark
To quote your page:
Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel
This isn't the same thing as the TensorFlow distributed training model at all.
No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.
You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)
I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.
You also never answered my question ;). Not sure what to assume here.
Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.
But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")
Not entirely sure about which question I missed, or what you want to assume. If it is this:
When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Then no. It was around November, and I got the demos working, and attempted to build a custom network.
Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).
Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.
Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.
For data parallelism, we have a simple wrapper that covers as many GPUs as you want in one box. For more than one box, we use Spark, and the integration is explained here:
We're almost done testing on DGX's now, like the one you have at OpenAI, and once we work a bit more with RDMA-enabled hardware, we'll go Sparkless for that.
That's how we're hoping to get academic HW acceleration into production environments.
In September, we plan to come out with a Scala API inspired by Keras/Torch, which will also share a neural net config file with Keras and allow model imports into DL4J.
We actually are going to be doing some stuff with IBM/NVLink as well as some other neat things I can't announce. We'll be able to benchmark on this front similar to you guys though :).
We'll be doing RDMA for this and plan on writing the code to match that using spark for orchestration and data storage/loading.
Right now the main thing we do is data parallelism on partitions of data with intermittent averaging with data being trained on various spark partitions.
I don't know, but I am curious about how many people, in percentages, outside Google and Facebook and the likes, need to scale their models to more than 8 GPUS?
If you're paying for CPU/GPU hours, the more you parallelize the faster you get your result for the same money. (Of course till you hit the parallelization limit of your network, but NNs are very parallelizable in general). And the larger your training dataset, the better your results.
I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?
Many of the models people are building here, such as generative image models, take a few days to train (say, 100 hours) on our 4 GPU boxes. Research would be faster if we could train on 400 GPUS in one hour, but the communication bandwidth required makes it hard to scale.
Training data is easy to duplicate and share nothing.
Large models with shared weights get tricky but less frequent asynchronous updates with schemes like hogwild seem to work with SGD. I believe TF has support for this too. It won't scale linearly but might be good enough.
There's some excitement about synthetic gradients to allow less communication and further parallelism.
The hpc community has certainly leveraged 400 GPU clusters.
It seems like a fun problem, and if you want to focus the resources, there isn't anything insurmountable about utilizing 400 GPUs :)
The DeepMind paper on synthetic gradients is super interesting for training in clusters. I hope it gets built into TF soon.
Also really hope there will be more options for GPU hardware on public clouds.
Eventually the pieces will come together and it will be trivial to deploy cloud 400gpus for an hour to run the load your local 4gpu workstation would spend 100 hours on, but we are definitely not there yet.
We are working on a packaging format called snappy (http://www.snapcraft.io) and starting to talk to TF guys about packaging TF with it (kubernetes is already packaged as a snap) - hopefully this will take some pain away once it's working
I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?
BTW I do not have any connection with DL4J, nor do I use it.
So, a single GTX 1080 GPU deep learning box would come around 1500$, if you pay 0.7$/hr for your cloud server, you should buy if you use more than 1500/0.7 = 2142h. So, if you need more than 90 days of GPU time, you should probably buy your box. Of course, if the cloud server is slower than GTX 1080, then the benefit is multiplied.
But ... your own box would not be scalable. You'd still need AWS to speed up training.
AFAIK, AWS is stuck several architecture revisions behind Pascal (most things I see say it's still a Kepler GK104). At a best guess, that 1080 is probably 2x faster than any single GPU AWS instance.