Hacker News new | past | comments | ask | show | jobs | submit login

If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.



Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.


This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64... for the vectorized expf


Oops, here is the actually used vectorized expf (similar to the one I linked, but with optional unrolling): https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).

[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


FWIW, you're replying to the developer of the file you linked to.


I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:

  void softmax1(int n, const float *x, float *y)
  {
      int i;
      float s, max = -FLT_MAX;
      for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
      for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
      for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
  }
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: