As someone who runs a production Spark cluster, has a multi-node TF setup and ha...

agibsonccc · on Aug 30, 2016

I'm assuming you're using pyspark?

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

Basically we train on raw spark partitions. Basically because of the data parallelism we can also use hardware accel for our spark jobs, which means for training we'll train on gpus when present.

We use a spark master for the "parameter server" to handle communication of the parameter updates with multiple ways of controlling the averaging threshold per iteration among other things to minimize communication.

This allows us to do neat things like using cudnn on a spark cluster. I'll assume you're not interested in too many details there but happy to expand if needed.

Edit: When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Running jvm based stuff is completely different than running a python job. I know jvm based stuff is a bit harder to get up and running because of the dependency clashes (with spark being jvm based etc)

If you're ever in scala land I encourage you to take a look at our keras port to scala when its out. We'll also be running reinforcement learning workloads on spark as well.

agibsonccc · on Aug 31, 2016

Not sure where the downvote came from - maybe I can explain a bit.

If it was from my last part - I'm actually asking for clarification because we've

often discovered this when people say they've tried us.

A lot of folks don't spend too much time on a framework and I don't blame them.

They are looking for a tool to get the job done and move on 99% of the time. That's more than fair.

The bulk of these folks also tend to be people who are mildly curious coming from python,

where most deep learning practitioners are.

That being said:

I was serious about each node being independent for spark.

Our linear algebra library runs as a spark job on each executor/slave node.

We've built in our own resource management via

javacpp that handles memory allocation for the cuda buffers

as well as cudnn.

The driver for spark then tracks the coefficients across spark rdd partitions which allows us to synchronize updates.

Look out for a blog post from us on parallelforall (nvidia's blog) here soon that explains how this works.

nl · on Aug 31, 2016

Also no - we use data paralellism. I'd maybe check out our new spark page we put up that explains how parameter averaging works: http://deeplearning4j.org/spark

To quote your page:

Data parallelism shards large datasets and hands those pieces to separate neural networks, say, each on its own core. Deeplearning4j relies on Spark for this, training models in parallel

This isn't the same thing as the TensorFlow distributed training model at all.

agibsonccc · on Aug 31, 2016

No..? I'm not sure how training on several shards at once and averaging the results asynchronously is hyper parameter tuning o_0 We have a whole dedicated library for that called arbiter.

You are thinking of grid search and the like. We implement grid search and bayesian on spark (the latter being closed source)

I didnt say it was the exactly the same as tf either. We have been doing this for close to 2 years now. Actually..Im not sure why it has to be? Its closer to hogwild and co.

You also never answered my question ;). Not sure what to assume here.

nl · on Aug 31, 2016

Sorry, I edited. I agree it's not hyperparamter optimisation, more some kind of regularisation thing.

But it isn't the same as what TensorFlow does, and I'd argue it is much closer to my initial characterisation ("can use Spark to coordinate training multiple models in parallel")

Not entirely sure about which question I missed, or what you want to assume. If it is this:

When you "played" with us - I'm assuming you just cloned our examples and ran us on the desktop? Likely more than a year ago before our c++ rewrite? I'd be curious to see if you ran us on spark as well.

Then no. It was around November, and I got the demos working, and attempted to build a custom network.

Edit: Was it the pyspark question? Then yes, but we also use Scala, Java and R (and SQL of course).

vonnik · on Aug 31, 2016

Are you comparing 2015 DL4J to 2016 TensorFlow? Your opinion of our framework seems outdated. You're welcome to try us now -- that would actually be a fair comparison.

vonnik · on Aug 30, 2016

Curious: When did you play with DL4J and what kind of cluster are you running TF on? You're right to say those are two separate things, so let's compare apples to apples.