Hacker News new | past | comments | ask | show | jobs | submit login

>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exponent, reduction, elementwise division, etc.): softmax = exp(logits) / reduce_sum(exp(logits), dim)

No, it can not be implemented this way, it is numerically unstable, and will produce NaNs if any input is greater than ~88.7. Luckily, it is also not how its implemented in Tensorflow: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

For a clean (and more efficient) C version of this algorithm, take a look at NNPACK reference implementation: https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...




The XLA and vanilla TF variants appear to be the same here:

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

Chalk it up to poetic license? ;-)


In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.


If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.


Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.


This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64... for the vectorized expf


Oops, here is the actually used vectorized expf (similar to the one I linked, but with optional unrolling): https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).

[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


FWIW, you're replying to the developer of the file you linked to.


I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:

  void softmax1(int n, const float *x, float *y)
  {
      int i;
      float s, max = -FLT_MAX;
      for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
      for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
      for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
  }
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: