Tensorflow is actually pretty slow and problematic on large clusters outside the...

rryan · on Aug 30, 2016

> Probably because that's not what it was designed for.

TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.

Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.

turinturambar · on Aug 30, 2016

Both points are true: TF is obviously designed for scalable, distributed training, but it also heavily tied to Google's compute infrastructure (less so all the time, of course, but now it's also being closely tied to Google Cloud). So while I disagree with my colleague* that TF is "slow" or "not designed for distributed training," I support the slightly different (and implicit) argument that there are some settings (often in enterprise, I am learning) where it might not be as good a fit as other frameworks (e.g., DL4J, Caffe, whatever).

* Disclosure: I work with Skymind and contribute to DL4J, and I also use TensorFlow/Theano/keras heavily in my PhD research. I am an equal opportunity framework guy. ;)

Chronic9q · on Aug 31, 2016

To all these Skymind kool-aid drinkers, I won't bother arguing with you. I'll let the tensorflow vs DL4J usage numbers tell the story.

Spoiler: Tensorflow wins.

vonnik · on Aug 31, 2016

Spoiler: 95% of them are Udacity students without experience or budgets.

vonnik · on Aug 30, 2016

The environment for your large clusters is almost certainly different from everyone using TF outside of Google. I'm speaking about the problems they'll run into.

nl · on Aug 31, 2016

I'm speaking about the problems they'll run into.

What are these mythical problems you speak of? I'd hear to hear some specifics, because I haven't hit them yet.

nl · on Aug 30, 2016

As someone who runs a production Spark cluster, has a multi-node TF setup and had played with DL4J I'd say this is more untrue than true.

Training TF on a cluster is hard because you need to write your model to be cluster-aware. Apart from that it works pretty well.

From my understanding, DL4J can use Spark to coordinate training multiple models in parallel. This is interesting, but not the same thing at all.

Also TF on a cluster is much simpler to get running than Spark+DL4J.

agibsonccc · on Aug 30, 2016

I'm assuming you're using pyspark?

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.

We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.

This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.

Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)

If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.

agibsonccc · on Aug 31, 2016

Not sure where the downvote came from - maybe I can explain a bit.

If it was from my last part - I'm actually asking for clarification because we've

often discovered this when people say they've tried us.

A lot of folks don't spend too much time on a framework and I don't blame them.

They are looking for a tool to get the job done and move on 99% of the time. That's more than fair.

The bulk of these folks also tend to be people who are mildly curious coming from python,

where most deep learning practitioners are.

That being said:

I was serious about each node being independent for spark.

Our linear algebra library runs as a spark job on each executor/slave node.

We've built in our own resource management via

javacpp that handles memory allocation for the cuda buffers

as well as cudnn.

The driver for spark then tracks the coefficients across spark rdd partitions which allows us to synchronize updates.

Look out for a blog post from us on parallelforall (nvidia's blog) here soon that explains how this works.

nl · on Aug 31, 2016

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

To quote your page:

Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel

This isn't the same thing as the TensorFlow distributed training model at all.

agibsonccc · on Aug 31, 2016

No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.

You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)

I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.

You also never answered my question ;). Not sure what to assume here.

nl · on Aug 31, 2016

Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.

But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")

Not entirely sure about which question I missed, or what you want to assume. If it is this:

When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Then no. It was around November, and I got the demos working, and attempted to build a custom network.

Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).

vonnik · on Aug 31, 2016

Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.

vonnik · on Aug 30, 2016

Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.

minimaxir · on Aug 30, 2016

You should disclose that Deeplearning4j is your startup.

vonnik · on Aug 30, 2016

Done!

tlb · on Aug 30, 2016

How does DL4J training scale across >8 GPUs?

vonnik · on Aug 30, 2016

Hi Trevor - Nice to see you here. :)

For data parallelism, we have a simple wrapper that covers as many GPUs as you want in one box. For more than one box, we use Spark, and the integration is explained here:

http://deeplearning4j.org/spark

The description for multi-GPUs is through this link: http://deeplearning4j.org/gpu under the subhead "Multi-GPU data parallelism".

The code is here: https://github.com/deeplearning4j/deeplearning4j/blob/77b836...

We're almost done testing on DGX's now, like the one you have at OpenAI, and once we work a bit more with RDMA-enabled hardware, we'll go Sparkless for that.

That's how we're hoping to get academic HW acceleration into production environments.

In September, we plan to come out with a Scala API inspired by Keras/Torch, which will also share a neural net config file with Keras and allow model imports into DL4J.

agibsonccc · on Aug 30, 2016

Hi Trevor,

We actually are going to be doing some stuff with IBM/NVLink as well as some other neat things I can't announce. We'll be able to benchmark on this front similar to you guys though :).

We'll be doing RDMA for this and plan on writing the code to match that using spark for orchestration and data storage/loading.

Right now the main thing we do is data parallelism on partitions of data with intermittent averaging with data being trained on various spark partitions.

Other than that, we have multi gpu settings.

We've made it pretty configurable though: http://deeplearning4j.org/gpu

Admittedly, we'll continue to do more work in this area though.

So far fp16 has been pretty nice though :).

dragandj · on Aug 30, 2016

I don't know, but I am curious about how many people, in percentages, outside Google and Facebook and the likes, need to scale their models to more than 8 GPUS?

euyyn · on Aug 30, 2016

If you're paying for CPU/GPU hours, the more you parallelize the faster you get your result for the same money. (Of course till you hit the parallelization limit of your network, but NNs are very parallelizable in general). And the larger your training dataset, the better your results.

dragandj · on Aug 30, 2016

I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?

euyyn · on Aug 30, 2016

It'd be a nice statistic to know. Could be dangerous too, as in the infamous "640 kilobytes should be enough for everyone".

tlb · on Aug 30, 2016

Many of the models people are building here, such as generative image models, take a few days to train (say, 100 hours) on our 4 GPU boxes. Research would be faster if we could train on 400 GPUS in one hour, but the communication bandwidth required makes it hard to scale.

emcq · on Aug 30, 2016

What is being shared between GPUs?

Training data is easy to duplicate and share nothing.

Large models with shared weights get tricky but less frequent asynchronous updates with schemes like hogwild seem to work with SGD. I believe TF has support for this too. It won't scale linearly but might be good enough.

There's some excitement about synthetic gradients to allow less communication and further parallelism.

The hpc community has certainly leveraged 400 GPU clusters.

It seems like a fun problem, and if you want to focus the resources, there isn't anything insurmountable about utilizing 400 GPUs :)

dharma1 · on Aug 31, 2016

The DeepMind paper on synthetic gradients is super interesting for training in clusters. I hope it gets built into TF soon.

Also really hope there will be more options for GPU hardware on public clouds.

Eventually the pieces will come together and it will be trivial to deploy cloud 400gpus for an hour to run the load your local 4gpu workstation would spend 100 hours on, but we are definitely not there yet.

We are working on a packaging format called snappy (http://www.snapcraft.io) and starting to talk to TF guys about packaging TF with it (kubernetes is already packaged as a snap) - hopefully this will take some pain away once it's working

dragandj · on Aug 30, 2016

I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?

BTW I do not have any connection with DL4J, nor do I use it.