>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exponent, reduction, elementwise division, etc.): softmax = exp(logits) / reduce_sum(exp(logits), dim)
In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.
If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.
Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.
This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.
Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).
I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:
void softmax1(int n, const float *x, float *y)
{
int i;
float s, max = -FLT_MAX;
for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
}
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.
Easier than that, I got Python 3.6 64bits and headed to Christoph Gohlke's amazing Python Wheels repository, downloaded everything I needed (NumPy, SciPy, OpenCV 3.2, matplotlib, etc + TensorFlow) and then it was just a matter of:
pip3 install [filename]
...until everything was installed. I gotta buy that man more than a few beers for all the time he's saved me over the years. Bonus points: got the wheels with Intel MKL and OpenCV+contrib.
It would seem Torch/PyTorch are faster than TF. TF uses static optimizations on the computation graph while Torch has a dynamic computation graph. Logically, static optimizations should be faster because they know the data size beforehand.
Tensorflow is getting dynamic jit optimization too. I think part of the reason that some dynamic optimizations might perform better is that the results of the optimization can be cached for most other batches and they can specialize to utilize batch/shape/input specific properties
I'm kind of getting the sense that TF is presently being optimized for its own massive development: that is, the engineers are yanking out chunks of code and replacing them, quickly, and at varying scales.
I've been looking around in a few places but I can't find a way to use XLA to compile tensorflow models for mobile devices. Is there a tutorial/blogpost by google(or anyone for that matter) talking about it?
Thanks!
It's gpucc, so it builds from LLVM when you enable XLA in the TF configure step.
Trying to understand: you don't want to ship a compiler library on principle, or is it some kind of product requirement, or ? There's lots of cool work to be done in the compiler space so use cases help to prioritize. :-) Thanks!
No, it can not be implemented this way, it is numerically unstable, and will produce NaNs if any input is greater than ~88.7. Luckily, it is also not how its implemented in Tensorflow: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...
For a clean (and more efficient) C version of this algorithm, take a look at NNPACK reference implementation: https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...