>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exp...

learyg · on March 8, 2017

The XLA and vanilla TF variants appear to be the same here:

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

Chalk it up to poetic license? ;-)

acmj · on March 8, 2017

In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.

Marat_Dukhan · on March 8, 2017

If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.

acmj · on March 8, 2017

Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.

Marat_Dukhan · on March 8, 2017

This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64... for the vectorized expf

Marat_Dukhan · on March 8, 2017

Oops, here is the actually used vectorized expf (similar to the one I linked, but with optional unrolling): https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...

acmj · on March 8, 2017

Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).

[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...

barrkel · on March 8, 2017

FWIW, you're replying to the developer of the file you linked to.

acmj · on March 8, 2017

I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:

  void softmax1(int n, const float *x, float *y)
  {
      int i;
      float s, max = -FLT_MAX;
      for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
      for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
      for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
  }

This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.