XLA: linear algebra library for TensorFlow

Marat_Dukhan · on March 8, 2017

>>> Softmax can be implemented as a composition of primitive TensorFlow ops (exponent, reduction, elementwise division, etc.): softmax = exp(logits) / reduce_sum(exp(logits), dim)

No, it can not be implemented this way, it is numerically unstable, and will produce NaNs if any input is greater than ~88.7. Luckily, it is also not how its implemented in Tensorflow: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

For a clean (and more efficient) C version of this algorithm, take a look at NNPACK reference implementation: https://github.com/Maratyszcza/NNPACK/blob/master/src/ref/so...

learyg · on March 8, 2017

The XLA and vanilla TF variants appear to be the same here:

https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a24...

Chalk it up to poetic license? ;-)

acmj · on March 8, 2017

In the nnpack implementation, the same exponential (i.e. expf) is computed twice for each element, which is a waste of time. A faster implementation should save each expf result to output[sample][channel] first, compute the sum and then rescale output[sample][channel] by the sum.

Marat_Dukhan · on March 8, 2017

If you have an efficient vectorized implementation of expf (NNPACK does), softmax is a memory/cache bandwidth-bound kernel, and storing data is less efficient than recomputing.

acmj · on March 8, 2017

Where is this vectorized expf implemented? I am only seeing softmax is calling the standard expf. Let's suppose expf from libm is vectorized. Is there any benchmark showing nnpack's implementation is really faster? I doubt, actually. Exp is quite expensive even if vectorized.

Marat_Dukhan · on March 8, 2017

This is the reference implementation of softmax, i.e. implementation used as a reference in unit tests. It is designed to be simple, readable, and correct, which is why I linked it here.

Optimized implementation is in assembly (PeachPy). See https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64... for the vectorized expf

Marat_Dukhan · on March 8, 2017

Oops, here is the actually used vectorized expf (similar to the one I linked, but with optional unrolling): https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...

acmj · on March 8, 2017

Thanks for the pointer. The full softmax implementation is here [1]. I have not read the code, but I can trust the developer to have a very fast implementation. Nonetheless, I don't think the reference implementation in your original link is optimal. Exp is expensive and should not be called twice (EDIT: unless you can show a benchmark to prove me wrong).

[1]: https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...

barrkel · on March 8, 2017

FWIW, you're replying to the developer of the file you linked to.

acmj · on March 8, 2017

I later realized he is the developer, but this does not change this discussion. Here is a micro benchmark, computing softmax 1 million times over a random vector of size 1000. On an old Linux server, calling the libm expf once takes 11.76 CPU seconds; calling it twice takes 25.15s. The implementation for calling expf once:

  void softmax1(int n, const float *x, float *y)
  {
      int i;
      float s, max = -FLT_MAX;
      for (i = 0; i < n; ++i) max = max > x[i]? max : x[i];
      for (i = 0, s = 0.0f; i < n; ++i) s += (y[i] = expf(x[i] - max));
      for (i = 0, s = 1.0f / s; i < n; ++i) y[i] *= s;
  }

This micro benchmark proves my point: using expf from libm, the reference implementation in nnpack is suboptimal. It is possible that a vectorized expf may change the fact, but the developer needs to prove it with numbers.

theCricketer · on March 8, 2017

Chris Leary, a compiler engineer at Google gave a talk about XLA at the recent Tensorflow Dev Summit:

https://www.youtube.com/watch?v=kAOanJczHA0

jakekovoor · on March 8, 2017

Thank you OP, this is really helpful. :)

If you need to install TensorFlow on Windows 10 you can follow this

http://saintlad.com/install-tensorflow-on-windows/

dr_zoidberg · on March 9, 2017

Easier than that, I got Python 3.6 64bits and headed to Christoph Gohlke's amazing Python Wheels repository, downloaded everything I needed (NumPy, SciPy, OpenCV 3.2, matplotlib, etc + TensorFlow) and then it was just a matter of:

    pip3 install [filename]

...until everything was installed. I gotta buy that man more than a few beers for all the time he's saved me over the years. Bonus points: got the wheels with Intel MKL and OpenCV+contrib.

[0] http://www.lfd.uci.edu/~gohlke/pythonlibs/

alok-g · on March 8, 2017

I would like to see the guide for the GPU version. Is it coming anytime soon? :-) Thanks.

ska · on March 8, 2017

anaconda has gpu and non gpu packages, if that helps.

visarga · on March 8, 2017

It would seem Torch/PyTorch are faster than TF. TF uses static optimizations on the computation graph while Torch has a dynamic computation graph. Logically, static optimizations should be faster because they know the data size beforehand.

So, why is TF slower?

pmalynin · on March 8, 2017

Tensorflow is getting dynamic jit optimization too. I think part of the reason that some dynamic optimizations might perform better is that the results of the optimization can be cached for most other batches and they can specialize to utilize batch/shape/input specific properties

killjoywashere · on March 8, 2017

I'm kind of getting the sense that TF is presently being optimized for its own massive development: that is, the engineers are yanking out chunks of code and replacing them, quickly, and at varying scales.

gcr · on March 8, 2017

A nitpick, but an important one: Only Pytorch uses a dynamic computation graph. Torch doesn't have any concept of a computation graph.

snnn · on March 8, 2017

TF has runtime cost based optimization, which is inside Session.run()

shoshin23 · on March 8, 2017

I've been looking around in a few places but I can't find a way to use XLA to compile tensorflow models for mobile devices. Is there a tutorial/blogpost by google(or anyone for that matter) talking about it? Thanks!

learyg · on March 9, 2017

Did you see the "using tfcompile" section of the docs? https://www.tensorflow.org/versions/master/experimental/xla/...

If you're looking for more detailed information that's missing from the docs, please do file a Github issue about it. Thanks!

ndesaulniers · on March 8, 2017

Even if you're not interested in machine learning or ai, XLA and particularly it's Python bindings are a great and easy way to do GPGPU programming.

probdist · on March 8, 2017

Why does this support JIT but not AOT for NVIDIA GPUs?

learyg · on March 8, 2017

AOT for GPUs is doable. Do you have a killer use case?

For CPU, mobile code footprint reduction was the driving force.

puzzle · on March 8, 2017

Not having to ship a toolchain (nvcc, gpucc or whatever equivalent linked as a library)?

learyg · on March 9, 2017

It's gpucc, so it builds from LLVM when you enable XLA in the TF configure step.

Trying to understand: you don't want to ship a compiler library on principle, or is it some kind of product requirement, or ? There's lots of cool work to be done in the compiler space so use cases help to prioritize. :-) Thanks!

posterboy · on March 8, 2017

which from dev POV means having to ship a tool(chain) (for the AoT stuff)!