>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exponent, reduction, elementwise division, etc.): softmax = exp(logits) / reduce_sum(exp(logits), dim)
In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.
If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.
Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.
This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.
Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).
I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:
void softmax1(int n, const float *x, float *y)
{
int i;
float s, max = -FLT_MAX;
for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
}
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.
No, it can not be implemented this way, it is numerically unstable, and will produce NaNs if any input is greater than ~88.7. Luckily, it is also not how its implemented in Tensorflow: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...
For a clean (and more efficient) C version of this algorithm, take a look at NNPACK reference implementation: https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...