Infrastructure for Deep Learning

programnature · on Aug 30, 2016

While its useful to have this kind of info, IMHO its still far from 'infrastructure for deep learning'. What about model versioning? What about deployment environments? We need to address the whole lifecycle, not just the 'training' bit. This is a huge and underserved part of the problem bc people tend to be satisfied with having 1 model thats good enough to publish.

tlb · on Aug 30, 2016

Indeed, deployment is a whole set of interesting issues. We haven't deployed any learned models in production yet at OpenAI, so it's not at the top of our list.

If the data and models were small and training was quick (on the order of compilation time), I'd just keep the training data in git and train the model from scratch every time I run make. But the data is huge, training requires clusters of machines and can take days, so you need a pipeline.

An industrial strength system looks like this: https://code.facebook.com/posts/1072626246134461/introducing...

platypii · on Aug 31, 2016

CTO of Algorithmia here. We've spent a lot of time thinking about the issues of deploying deep learning models. There are a whole set of challenges that crop up when trying to scale these kinds of deployments (not least of which is trying to manage GPU memory).

It would be interesting to compare notes since we have deployed a number of models in production, and seem to focus on a related but different set of challenges. kenny at company dot com.

programnature · on Aug 30, 2016

Yes, understandable. I encourage viewing this as part of the 'open' mandate.

agibsonccc · on Aug 30, 2016

When you're thinking of "deployment" here - wouldn't it make sense to use the google compute engine for this?

I'd be curious to see if there's a legit speed up there with the "real tensorflow".

For "on prem" stuff I think "deployment" is going to depend on the actual end use case.

Eg:no one in industry will keep their "training data" in git. They'd have an actual database with other systems surrounding it.

If it's just "run the model locally to view a web page running in a docker container I wouldn't see the problem here though.

The infra will also be different for training vs inference. For training you'll want gpus, but it's not realistic to run gpus with inference yet.

I'd love someone to comment on: https://developer.nvidia.com/gpu-inference-engine

though.

There's going to be a lot of non deep learning "stuff" involved here.

Much of it will be connected to the use case. Eg: deep learning for log analytics in production will be different than a computer vision pipeline.

Warning: highly biased player in the space.

ymt123 · on Aug 30, 2016

Have you tried Sacred[1]? It definitely doesn't answer the "infrastructure for deep learning" challenge but it is helpful for understanding what experiments have been run/where did this model come from (including what version of the code/parameters produced it)

[1] https://github.com/IDSIA/sacred

asimuvPR · on Aug 30, 2016

So true. I've been doodling some tools to somehow manage all of it. So far I only have git-like approaches to models and chef-like approaches to infrastructure. I hope to somehow bring all together into a docker-like package that can be deployed without much hassle.

daveguy · on Aug 30, 2016

You might want to check out Pachyderm -- that is essentially what they are trying to do (Analytics infrastructure support. It isn't specific to machine learning):

http://www.pachyderm.io/

asimuvPR · on Aug 30, 2016

I had forgotten about them. Thanks for posting the link.

vonnik · on Aug 30, 2016

Fwiw, we're testing a Dockerized the distro of DL4J. Runs on DCOS.

https://imgur.com/a/CDTAc

https://imgur.com/a/6jlxi

We'll release in coming weeks.

kyloon · on Aug 30, 2016

In terms of deploying trained models, you can probably get away with using TensorFlow Serving and let Kubernetes handle the orchestration and scaling part of the job. I do agree that there is certainly a need to have a tool that glues all these different bits and pieces together for improving the process of taking a model from development to production.

turinturambar · on Aug 30, 2016

Agreed. A very interesting and thoughtful post, but I think that you are right that OpenAI's primary use cases seem to be (unsurprisingly) academic research and rapid prototyping of new ideas. These emphasize very different set of problems than, say, deploying something in production or as a service.

Thus, this post seems immensely useful to someone like me (a PhD student, also primarily concerned with exploring new ideas and getting my next conference paper), but I can see how others doing machine learning in-the-wild or in production might see a lot of questions left unanswered. I, for one, work primarily with health care data from hospital EHRs, and I spent a lot more time with data prep pipelines than folks working with, say, MNIST.

thr0waway1239 · on Aug 30, 2016

I don't know much about deep learning. Just noticed that there are 40+ upvotes and 0 comments. I propose the HN Bikeshedding effect theory. Take the number of comments and divide it by the number of upvotes.

<0.1 = Too technical for even HN audience 0.1-1.0 = At the right level for the HN audience >1 = The topic is similar to painting the bike shed.

minimaxir · on Aug 30, 2016

The high amount of upvotes and low ratio of votes/comments on deep learning/big data posts is unfortunately accurate.

It's not a problem that HN has topics which are frequently upvoted; topics such as employment and Rust are popular memes.

It is a problem, however, if the upvote-for-the-title crowd upvote articles which are bad and would not get upvotes if they were about another topic. That's a legit hard problem to solve (what makes a good submission?), unfortunately, but one I've been looking into.

(For clarity, this submission is a good submission, but I've seen quite a few top-ranking HN submissions that are just a bar chart on a controversial topic that is poorly sourced. And linkbait about deep leaning tends to get upvotes, but flagged too.)

mamon · on Aug 30, 2016

There's a problem with upvoting on HN: upvote is also the only saving mechanism available: If I run across some click-baity title, but I don't have time to read it right now I will click upvote, but what I really mean by that is "save for later".

Maybe HN just needs to separate bookmarking and upvoting?

detaro · on Aug 30, 2016

... HN has added favorites recently ("favorite" link, on the submission/individual comments page) and thus has both. And apparently really needs a more public changelog.

thr0waway1239 · on Aug 30, 2016

I was just making an observation, certainly didn't intend it as criticism.

asimuvPR · on Aug 30, 2016

Have you considered that some posts may simply be a good read? I often don't comment because the post itself made a good job of stating a point. There is no need to reiterate what was said in a post just to get HN points.

thr0waway1239 · on Aug 30, 2016

HN is like Leslie Knope and has an opinion about everything :-)

Jokes aside, I agree with your comment.

asimuvPR · on Aug 30, 2016

I've created two scrapbooks commemorating this comment. Where I should ship them to? ;-)

thr0waway1239 · on Aug 30, 2016

Burt Macklin, FBI.

Houshalter · on Aug 30, 2016

That's already how HN works. Articles with too many comments get a huge penalty and get pushed off the front page quickly.

It has the effect of stifling a lot of interesting and important discussions and topics. I make an effort to up vote posts with too many comments to help them.

TheAlchemist · on Aug 30, 2016

Sounds about right !

But actually all Deep Learning stuff, tutorials, videos are upvoted pretty quickly those times.

ymt123 · on Aug 30, 2016

It's great to see people talking about the infrastructure they use to manage their deep learning workloads.

One area where we've had trouble with other orchestration tools (e.g. Docker Swarm) was in managing resources at anything beyond whole boxes. They are all good at managing CPU/RAM/Disk but we've had trouble with give this task GPU2. We had planned to try Mesos (given that we already run it for other things) but it sounds like maybe we should take a harder look at Kubernetes first.

freyr · on Aug 30, 2016

> Like much of the deep learning community, we use Python 2.7

It's unfortunate that so much effort has been spent on bringing tools up to speed with Python 3, but some groups still insist on dragging their feet. I understand the motivation when we're talking about an established company with a huge legacy code base, but within the research community it's kind of embarrassing.

daveguy · on Aug 31, 2016

Python 2.7 is the present and future of scientific computing with Python. If there is one field in which a print statement is a critical feature it is quick interactive analysis prototyping. That won't change no matter how much Guido and co want it to change.

nahumfarchi · on Aug 31, 2016

Curious here, what difference does it make if it's a statement or a function?

daveguy · on Aug 31, 2016

4 extra keystrokes. If you could add a statement I would add "p" and change my muscle memory to use that. For me it doesn't even need the ">>" syntax, just quick display. Rapid prototyping needs rapid feedback.

Edit: it's not just the quickness of a single statement. It's that the print statements are about 30-50% of the code when you are working this way.

freyr · on Sept 2, 2016

I can see how you'd feel that way if 50% of your code is print statements. As a counterpoint, I write Matlab and Python code day in and day out, and print statements are relatively rare in my code. Same for most of my peers. I'm guessing we work in different research domains, but I can't imagine a good scenario where my research productivity becomes limited by how fast I can type print statements.

patcallier · on Aug 31, 2016

To be fair we've ingested a fair amount of python 2.7 research code for our Python 3 codebase and the print statements are the quickest of fixes. There are rarer actual gotchas, but 2to3 catches a fair number of them. We only switched for the machine learning project I'm working on right now as an experiment, but it's surprising how smoothly it's gone.

daveguy · on Aug 31, 2016

I'm not talking about it being a bug that needs a fix. That is easy enough in existing code. Im talking about when you are using Python like Matlab or Mathematica. Analyzing data and quickly viewing the results or subsets of the results.

ebalit · on Aug 31, 2016

You should probably use Jupyter notebook if you don't use it already. It's great for exploratory coding like data analysis. And as the last evaluated expression of a code block is automatically printed, no need for a print statement.

vonnik · on Aug 30, 2016

Tensorflow is actually pretty slow and problematic on large clusters outside the Google Cloud. Probably because that's not what it was designed for.

For Java/Scala people, Deeplearning4j has a pretty sophisticated Spark + GPUs setup:

http://deeplearning4j.org/gpu

http://deeplearning4j.org/spark

http://deeplearning4j.org/spark-gpus

[Disclosure: I help create DL4J, and it's supported by my startup, Skymind.]

rryan · on Aug 30, 2016

> Probably because that's not what it was designed for.

TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.

Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.

turinturambar · on Aug 30, 2016

Both points are true: TF is obviously designed for scalable, distributed training, but it also heavily tied to Google's compute infrastructure (less so all the time, of course, but now it's also being closely tied to Google Cloud). So while I disagree with my colleague* that TF is "slow" or "not designed for distributed training," I support the slightly different (and implicit) argument that there are some settings (often in enterprise, I am learning) where it might not be as good a fit as other frameworks (e.g., DL4J, Caffe, whatever).

* Disclosure: I work with Skymind and contribute to DL4J, and I also use TensorFlow/Theano/keras heavily in my PhD research. I am an equal opportunity framework guy. ;)

Chronic9q · on Aug 31, 2016

To all these Skymind kool-aid drinkers, I won't bother arguing with you. I'll let the tensorflow vs DL4J usage numbers tell the story.

Spoiler: Tensorflow wins.

vonnik · on Aug 31, 2016

Spoiler: 95% of them are Udacity students without experience or budgets.

vonnik · on Aug 30, 2016

The environment for your large clusters is almost certainly different from everyone using TF outside of Google. I'm speaking about the problems they'll run into.

nl · on Aug 31, 2016

I'm speaking about the problems they'll run into.

What are these mythical problems you speak of? I'd hear to hear some specifics, because I haven't hit them yet.

nl · on Aug 30, 2016

As someone who runs a production Spark cluster, has a multi-node TF setup and had played with DL4J I'd say this is more untrue than true.

Training TF on a cluster is hard because you need to write your model to be cluster-aware. Apart from that it works pretty well.

From my understanding, DL4J can use Spark to coordinate training multiple models in parallel. This is interesting, but not the same thing at all.

Also TF on a cluster is much simpler to get running than Spark+DL4J.

agibsonccc · on Aug 30, 2016

I'm assuming you're using pyspark?

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.

We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.

This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.

Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)

If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.

agibsonccc · on Aug 31, 2016

Not sure where the downvote came from - maybe I can explain a bit.

If it was from my last part - I'm actually asking for clarification because we've

often discovered this when people say they've tried us.

A lot of folks don't spend too much time on a framework and I don't blame them.

They are looking for a tool to get the job done and move on 99% of the time. That's more than fair.

The bulk of these folks also tend to be people who are mildly curious coming from python,

where most deep learning practitioners are.

That being said:

I was serious about each node being independent for spark.

Our linear algebra library runs as a spark job on each executor/slave node.

We've built in our own resource management via

javacpp that handles memory allocation for the cuda buffers

as well as cudnn.

The driver for spark then tracks the coefficients across spark rdd partitions which allows us to synchronize updates.

Look out for a blog post from us on parallelforall (nvidia's blog) here soon that explains how this works.

nl · on Aug 31, 2016

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

To quote your page:

Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel

This isn't the same thing as the TensorFlow distributed training model at all.

agibsonccc · on Aug 31, 2016

No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.

You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)

I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.

You also never answered my question ;). Not sure what to assume here.

nl · on Aug 31, 2016

Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.

But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")

Not entirely sure about which question I missed, or what you want to assume. If it is this:

When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Then no. It was around November, and I got the demos working, and attempted to build a custom network.

Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).

vonnik · on Aug 31, 2016

Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.

vonnik · on Aug 30, 2016

Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.

minimaxir · on Aug 30, 2016

You should disclose that Deeplearning4j is your startup.

vonnik · on Aug 30, 2016

Done!

tlb · on Aug 30, 2016

How does DL4J training scale across >8 GPUs?

vonnik · on Aug 30, 2016

Hi Trevor - Nice to see you here. :)

For data parallelism, we have a simple wrapper that covers as many GPUs as you want in one box. For more than one box, we use Spark, and the integration is explained here:

http://deeplearning4j.org/spark

The description for multi-GPUs is through this link: http://deeplearning4j.org/gpu under the subhead "Multi-GPU data parallelism".

The code is here: https://github.com/deeplearning4j/deeplearning4j/blob/77b836...

We're almost done testing on DGX's now, like the one you have at OpenAI, and once we work a bit more with RDMA-enabled hardware, we'll go Sparkless for that.

That's how we're hoping to get academic HW acceleration into production environments.

In September, we plan to come out with a Scala API inspired by Keras/Torch, which will also share a neural net config file with Keras and allow model imports into DL4J.

agibsonccc · on Aug 30, 2016

Hi Trevor,

We actually are going to be doing some stuff with IBM/NVLink as well as some other neat things I can't announce. We'll be able to benchmark on this front similar to you guys though :).

We'll be doing RDMA for this and plan on writing the code to match that using spark for orchestration and data storage/loading.

Right now the main thing we do is data parallelism on partitions of data with intermittent averaging with data being trained on various spark partitions.

Other than that, we have multi gpu settings.

We've made it pretty configurable though: http://deeplearning4j.org/gpu

Admittedly, we'll continue to do more work in this area though.

So far fp16 has been pretty nice though :).

dragandj · on Aug 30, 2016

I don't know, but I am curious about how many people, in percentages, outside Google and Facebook and the likes, need to scale their models to more than 8 GPUS?

euyyn · on Aug 30, 2016

If you're paying for CPU/GPU hours, the more you parallelize the faster you get your result for the same money. (Of course till you hit the parallelization limit of your network, but NNs are very parallelizable in general). And the larger your training dataset, the better your results.

dragandj · on Aug 30, 2016

I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?

euyyn · on Aug 30, 2016

It'd be a nice statistic to know. Could be dangerous too, as in the infamous "640 kilobytes should be enough for everyone".

tlb · on Aug 30, 2016

Many of the models people are building here, such as generative image models, take a few days to train (say, 100 hours) on our 4 GPU boxes. Research would be faster if we could train on 400 GPUS in one hour, but the communication bandwidth required makes it hard to scale.

emcq · on Aug 30, 2016

What is being shared between GPUs?

Training data is easy to duplicate and share nothing.

Large models with shared weights get tricky but less frequent asynchronous updates with schemes like hogwild seem to work with SGD. I believe TF has support for this too. It won't scale linearly but might be good enough.

There's some excitement about synthetic gradients to allow less communication and further parallelism.

The hpc community has certainly leveraged 400 GPU clusters.

It seems like a fun problem, and if you want to focus the resources, there isn't anything insurmountable about utilizing 400 GPUs :)

dharma1 · on Aug 31, 2016

The DeepMind paper on synthetic gradients is super interesting for training in clusters. I hope it gets built into TF soon.

Also really hope there will be more options for GPU hardware on public clouds.

Eventually the pieces will come together and it will be trivial to deploy cloud 400gpus for an hour to run the load your local 4gpu workstation would spend 100 hours on, but we are definitely not there yet.

We are working on a packaging format called snappy (http://www.snapcraft.io) and starting to talk to TF guys about packaging TF with it (kubernetes is already packaged as a snap) - hopefully this will take some pain away once it's working

dragandj · on Aug 30, 2016

I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?

BTW I do not have any connection with DL4J, nor do I use it.

josh_carterPDX · on Aug 30, 2016

"Top performance thus requires top-of-the-line GPUs."

Would be curious to see the data around the economics of the different options.

komali2 · on Aug 30, 2016

Especially metrics comparing cost of getting your own vs offloading to AWS or whatever. When is the "break even" point for buying your own?

visarga · on Aug 30, 2016

So, a single GTX 1080 GPU deep learning box would come around 1500$, if you pay 0.7$/hr for your cloud server, you should buy if you use more than 1500/0.7 = 2142h. So, if you need more than 90 days of GPU time, you should probably buy your box. Of course, if the cloud server is slower than GTX 1080, then the benefit is multiplied.

But ... your own box would not be scalable. You'd still need AWS to speed up training.

coredog64 · on Aug 30, 2016

AFAIK, AWS is stuck several architecture revisions behind Pascal (most things I see say it's still a Kepler GK104). At a best guess, that 1080 is probably 2x faster than any single GPU AWS instance.

josh_carterPDX · on Aug 30, 2016

Seems to be a tough business to bootstrap. Interesting. Thanks for laying that out.

cs702 · on Aug 30, 2016

On a related note, I'm running a poll on deep learning frameworks: https://news.ycombinator.com/item?id=12391744

mitbal · on Aug 31, 2016

Very interesting article but I guess the scale is not for everyone. 1600 AWS GPU? I'll be lucky if my infra request for g2.8xlarge is approved.