> Probably because that's not what it was designed for.
TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.
Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.
Both points are true: TF is obviously designed for scalable, distributed training, but it also heavily tied to Google's compute infrastructure (less so all the time, of course, but now it's also being closely tied to Google Cloud). So while I disagree with my colleague* that TF is "slow" or "not designed for distributed training," I support the slightly different (and implicit) argument that there are some settings (often in enterprise, I am learning) where it might not be as good a fit as other frameworks (e.g., DL4J, Caffe, whatever).
* Disclosure: I work with Skymind and contribute to DL4J, and I also use TensorFlow/Theano/keras heavily in my PhD research. I am an equal opportunity framework guy. ;)
The environment for your large clusters is almost certainly different from everyone using TF outside of Google. I'm speaking about the problems they'll run into.
TF was explicitly designed with distributed training in mind (their initial whitepaper and the DistBelief paper that came before it make this clear) -- I don't know how you came to this conclusion.
Usually when people say TF is slow, it turns out they've introduced a serious bottleneck somewhere.