Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works:
http://deeplearning4j.org/spark
Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.
We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.
This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.
Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)
If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.
Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark
To quote your page:
Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel
This isn't the same thing as the TensorFlow distributed training model at all.
No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.
You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)
I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.
You also never answered my question ;). Not sure what to assume here.
Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.
But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")
Not entirely sure about which question I missed, or what you want to assume. If it is this:
When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.
Then no. It was around November, and I got the demos working, and attempted to build a custom network.
Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).
Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.
Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.
Training TF on a cluster is hard because you need to write your model to be cluster-aware. Apart from that it works pretty well.
From my understanding, DL4J can use Spark to coordinate training multiple models in parallel. This is interesting, but not the same thing at all.
Also TF on a cluster is much simpler to get running than Spark+DL4J.