Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).
I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:
void softmax1(int n, const float *x, float *y)
{
int i;
float s, max = -FLT_MAX;
for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
}
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.
[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...