> Probably because that's not what it was designed for.
TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.
Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.
Both points are true: TF is obviously designed for scalable, distributed training, but it also heavily tied to Google's compute infrastructure (less so all the time, of course, but now it's also being closely tied to Google Cloud). So while I disagree with my colleague* that TF is "slow" or "not designed for distributed training," I support the slightly different (and implicit) argument that there are some settings (often in enterprise, I am learning) where it might not be as good a fit as other frameworks (e.g., DL4J, Caffe, whatever).
* Disclosure: I work with Skymind and contribute to DL4J, and I also use TensorFlow/Theano/keras heavily in my PhD research. I am an equal opportunity framework guy. ;)
The environment for your large clusters is almost certainly different from everyone using TF outside of Google. I'm speaking about the problems they'll run into.
Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works:
http://deeplearning4j.org/spark
Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.
We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.
This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.
Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)
If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.
Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark
To quote your page:
Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel
This isn't the same thing as the TensorFlow distributed training model at all.
No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.
You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)
I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.
You also never answered my question ;). Not sure what to assume here.
Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.
But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")
Not entirely sure about which question I missed, or what you want to assume. If it is this:
When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Then no. It was around November, and I got the demos working, and attempted to build a custom network.
Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).
Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.
Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.
For data parallelism, we have a simple wrapper that covers as many GPUs as you want in one box. For more than one box, we use Spark, and the integration is explained here:
We're almost done testing on DGX's now, like the one you have at OpenAI, and once we work a bit more with RDMA-enabled hardware, we'll go Sparkless for that.
That's how we're hoping to get academic HW acceleration into production environments.
In September, we plan to come out with a Scala API inspired by Keras/Torch, which will also share a neural net config file with Keras and allow model imports into DL4J.
We actually are going to be doing some stuff with IBM/NVLink as well as some other neat things I can't announce. We'll be able to benchmark on this front similar to you guys though :).
We'll be doing RDMA for this and plan on writing the code to match that using spark for orchestration and data storage/loading.
Right now the main thing we do is data parallelism on partitions of data with intermittent averaging with data being trained on various spark partitions.
I don't know, but I am curious about how many people, in percentages, outside Google and Facebook and the likes, need to scale their models to more than 8 GPUS?
If you're paying for CPU/GPU hours, the more you parallelize the faster you get your result for the same money. (Of course till you hit the parallelization limit of your network, but NNs are very parallelizable in general). And the larger your training dataset, the better your results.
I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?
Many of the models people are building here, such as generative image models, take a few days to train (say, 100 hours) on our 4 GPU boxes. Research would be faster if we could train on 400 GPUS in one hour, but the communication bandwidth required makes it hard to scale.
Training data is easy to duplicate and share nothing.
Large models with shared weights get tricky but less frequent asynchronous updates with schemes like hogwild seem to work with SGD. I believe TF has support for this too. It won't scale linearly but might be good enough.
There's some excitement about synthetic gradients to allow less communication and further parallelism.
The hpc community has certainly leveraged 400 GPU clusters.
It seems like a fun problem, and if you want to focus the resources, there isn't anything insurmountable about utilizing 400 GPUs :)
The DeepMind paper on synthetic gradients is super interesting for training in clusters. I hope it gets built into TF soon.
Also really hope there will be more options for GPU hardware on public clouds.
Eventually the pieces will come together and it will be trivial to deploy cloud 400gpus for an hour to run the load your local 4gpu workstation would spend 100 hours on, but we are definitely not there yet.
We are working on a packaging format called snappy (http://www.snapcraft.io) and starting to talk to TF guys about packaging TF with it (kubernetes is already packaged as a snap) - hopefully this will take some pain away once it's working
I am aware of that. The question was not whether more than 8 GPUs would be useful in ideal circumstances, it was about how many people actually use that functionality with other frameworks other than DL4J?
BTW I do not have any connection with DL4J, nor do I use it.
For Java/Scala people, Deeplearning4j has a pretty sophisticated Spark + GPUs setup:
http://deeplearning4j.org/gpu
http://deeplearning4j.org/spark
http://deeplearning4j.org/spark-gpus
[Disclosure: I help create DL4J, and it's supported by my startup, Skymind.]