Hacker News new | past | comments | ask | show | jobs | submit login
XLA: linear algebra library for TensorFlow (googleblog.com)
176 points by mud_dauber on March 8, 2017 | hide | past | favorite | 28 comments



>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exponent, reduction, elementwise division, etc.): softmax = exp(logits) / reduce_sum(exp(logits), dim)

No, it can not be implemented this way, it is numerically unstable, and will produce NaNs if any input is greater than ~88.7. Luckily, it is also not how its implemented in Tensorflow: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

For a clean (and more efficient) C version of this algorithm, take a look at NNPACK reference implementation: https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...


The XLA and vanilla TF variants appear to be the same here:

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

Chalk it up to poetic license? ;-)


In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.


If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.


Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.


This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64... for the vectorized expf


Oops, here is the actually used vectorized expf (similar to the one I linked, but with optional unrolling): https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).

[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...


FWIW, you're replying to the developer of the file you linked to.


I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:

  void softmax1(int n, const float *x, float *y)
  {
      int i;
      float s, max = -FLT_MAX;
      for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
      for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
      for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
  }
This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.


Chris Leary, a compiler engineer at Google gave a talk about XLA at the recent Tensorflow Dev Summit:

https://www.youtube.com/watch?v=kAOanJczHA0


Thank you OP, this is really helpful. :)

If you need to install TensorFlow on Windows 10 you can follow this

http://saintlad.com/install-tensorflow-on-windows/


Easier than that, I got Python 3.6 64bits and headed to Christoph Gohlke's amazing Python Wheels repository, downloaded everything I needed (NumPy, SciPy, OpenCV 3.2, matplotlib, etc + TensorFlow) and then it was just a matter of:

    pip3 install [filename]
...until everything was installed. I gotta buy that man more than a few beers for all the time he's saved me over the years. Bonus points: got the wheels with Intel MKL and OpenCV+contrib.

[0] http://www.lfd.uci.edu/~gohlke/pythonlibs/


I would like to see the guide for the GPU version. Is it coming anytime soon? :-) Thanks.


anaconda has gpu and non gpu packages, if that helps.


It would seem Torch/PyTorch are faster than TF. TF uses static optimizations on the computation graph while Torch has a dynamic computation graph. Logically, static optimizations should be faster because they know the data size beforehand.

So, why is TF slower?


Tensorflow is getting dynamic jit optimization too. I think part of the reason that some dynamic optimizations might perform better is that the results of the optimization can be cached for most other batches and they can specialize to utilize batch/shape/input specific properties


I'm kind of getting the sense that TF is presently being optimized for its own massive development: that is, the engineers are yanking out chunks of code and replacing them, quickly, and at varying scales.


A nitpick, but an important one: Only Pytorch uses a dynamic computation graph. Torch doesn't have any concept of a computation graph.


TF has runtime cost based optimization, which is inside Session.run()


I've been looking around in a few places but I can't find a way to use XLA to compile tensorflow models for mobile devices. Is there a tutorial/blogpost by google(or anyone for that matter) talking about it? Thanks!


Did you see the "using tfcompile" section of the docs? https://www.tensorflow.org/versions/master/experimental/xla/...

If you're looking for more detailed information that's missing from the docs, please do file a Github issue about it. Thanks!


Even if you're not interested in machine learning or ai, XLA and particularly it's Python bindings are a great and easy way to do GPGPU programming.


Why does this support JIT but not AOT for NVIDIA GPUs?


AOT for GPUs is doable. Do you have a killer use case?

For CPU, mobile code footprint reduction was the driving force.


Not having to ship a toolchain (nvcc, gpucc or whatever equivalent linked as a library)?


It's gpucc, so it builds from LLVM when you enable XLA in the TF configure step.

Trying to understand: you don't want to ship a compiler library on principle, or is it some kind of product requirement, or ? There's lots of cool work to be done in the compiler space so use cases help to prioritize. :-) Thanks!


which from dev POV means having to ship a tool(chain) (for the AoT stuff)!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: