Automatic Algorithms Optimization via Fast Matrix Exponentiation (2015)

DannyBee · on July 23, 2018

So, compilers can do this, and do do this, depending on the language and it's guarantees.

GCC and LLVM use a chain of recurrence algebra and form for symbolic analysis that could easily represent this (or be extended to). But they probably won't bother any time soon.

To understand why requires a little more detail first:

The statement "However, compilers do not replace the computational algorithm, written by a human, by a more asymptotically efficient one."

is mostly wrong.

They generally do where they are allowed. They just aren't often allowed, or it's not worth it.

There are languages that are playing with this even more heavily.

But whether it's a good idea, eh.

In numerics in particular, people often pick their algorithms very carefully for stability/etc reasons.

For more generic things like sorts, a lot of this gets hidden in standard libraries anyway.

For things nobody could really complain about, the compilers do idiom recognition of things like memcpys, string compares, etc and replace them.

In any case, the problem is not recognizing and replacing the algorithm or idiom. The problem is deciding when doing so actually serves your users well.

Going back up, when we first implemented scalar evolutions in GCC, we supported everything you can think of - closed form polynomial functions, matrices, mixers.

In the end though, we were spending a lot of time handling cases that were very rare, mostly optimal as written anyway (so not useful for the optimizer), but still required exponential time solving/etc to do anything with.

So a lot of that code was removed.

Could you handle matrix exponentiation because it's fast? Sure.

But it's unlikely to be worth it.

waterhouse · on July 23, 2018

> compilers ... replace the computational algorithm, written by a human, by a more asymptotically efficient one.

Andrew Appel in Compiling with Continuations gave an example of this. I forget exactly how it went, but it was something like this:

  (define (range a b)
    (if (> a b)
        '()
        (cons a (range (+ a 1) b))))
  
  (define (sum xs)
    (if (null? xs)
        0
        (+ (car xs) (sum (cdr xs)))))
  
  (define (triangle-sum n)
    (if (= n 0)
        0
        (let* ((xs (range 1 n))
               (triangle (sum xs)))
          (+ triangle (triangle-sum (- n 1))))))

The triangle-sum program, which adds the triangular numbers from 1 to n, will, as written, consume O(n^2) space, because it creates the list "xs" of length n before recursing down to n-1, and the list will not be freed until the entire computation terminates. An optimizing compiler might discover that, once it's computed "triangle", it can discard the list "xs"; this would reduce the required space to O(n).

Hoisting a computation like (length <list-of-length-n>) out of a loop can similarly reduce the time taken from O(n^2) to O(n).

notatcomputer67 · on July 23, 2018

I think optimizing time complexity is a bad idea in c++. Let's say I write an o(n^2) sort because i know it will be faster for the small n's I have, then I don't want some meddling compiler "optimize" it for me.

Another reason. People will come to depend on it and then seemingly innocent changes will mysteriously degrade performance. That's btw why I'm excited about guaranteed copy elision in c++17 - I feel uncomfortable relying on the compiler probably not copying my massive objects.

Overall I could see the trade-off be worthwhile in some languages but not c++.

AstralStorm · on July 23, 2018

It is not about being allowed at all. The compiler does not understand the required high level result almost always.

There are some attempts to patch typical general algorithms (e.g. tree searches and implementations) with faster equivalents. They typically fail as devising such structure safely from low level code is a really hard problem.

DannyBee · on July 23, 2018

"The compiler does not understand the required high level result almost always."

Depends heavily on the language and code we are talking about. For C++, there are plenty of compilers that can replace C++ standard library calls and usage with faster, more optimal, C++ library calls and usage. That's because it is possible to understand the guarantees needed.

That's about as far as you can get in a language like C++. You can do complex idiom recognition (IE recognize handwritten sorts, whatever) but it's usually not worth it.

Now obviously, there are languages that express invariants/contracts/etc that make this type of optimization much easier to do.

"They typically fail as devising such structure safely from low level code is a really hard problem."

Actually, the safety part is easier than the allowed part if you are trying to split them.

But usually they are the same thing - the language does not allow it, therefore it is unsafe.

It is generally easy and well understood how to analyze and know what set of preconditions any particular transform safe, and even if you can't prove it statically, you can insert runtime checks to make sure the preconditions hold. This generalizes very well. It's just not worth it most of the time.

This is why we focus on subsets like loops, and things like polyhedral loop opts, etc, have complex cost models that try to take into account the cost of dynamic checks, etc.

(I spent years funding and talking to researchers about this)

bollu · on July 23, 2018

As a follow up, there are ways to detect high level patterns (such as matmul). Polly, LLVM's polyhedral loop optimiser does this [for matmul](https://polly.llvm.org/docs/Performance.html). We plan on generalising this to support more operations.

jcranmer · on July 23, 2018

There's always been a sort of pie-in-the-sky goal of "let people just say what the algorithm does, and let the compiler figure out the best way to implement that algorithm" with the idea being that the programmer doesn't have to concern himself/herself with details about the hardware that bear on the implementation.

The main problem with that idea is that you have to find some language to describe the design, which often ends up being an implementation, and when you look at the invariants that must be maintained for correctness, it's difficult to deviate from that implementation substantially.

Floating point arithmetic, which is the core of most numerical algorithms, is not associative (it is commutative). If someone asks you to sum up a list going from left to right, it is not legal to transform that into a sum that goes from right to left. And unfortunately, in a mathematical sense, associativity is the property that's often the most useful.

So, in practice, the most useful cases to do this kind of heavy-mathematics optimization on are the ones where it's not legal in the first place. Even beyond that, many of these kinds of algorithms are already implemented in a core library that's been around for decades and has been highly-tuned for any architecture you want to run on, so the actual prospect for seeing real gains with a compiler are rather small.

Darmani · on July 23, 2018

> The main problem with that idea is that you have to find some language to describe the design, which often ends up being an implementation, and when you look at the invariants that must be maintained for correctness, it's difficult to deviate from that implementation substantially.

Can we please stop repeating this line? It was false 50-60 years ago when garbage collection and register allocation was invented. It was false when yacc was built. And it's false today.

The gap in detail between specification and implementation is huge. If you can change a function without changing its callers, then your program has some slack between the assumptions placed on that function and its implementation. From low-level performance things like changes to improve cache locality, to high-level ones like scheduling of data pipelines, there's a lot of space for synthesizers and search-based software improvement tools to work with.

So, are you curious about program synthesis for floating-point accuracy? Have a look at Herbie.

jcranmer · on July 23, 2018

> Can we please stop repeating this line? It was false 50-60 years ago when garbage collection and register allocation was invented. It was false when yacc was built. And it's false today.

No, it's still very true today. We have pushed the bounds of what implementation details we are free to ignore, but those bounds are still very much on the order of "you have to program the algorithm I implemented" as opposed to "you have the freedom to choose any algorithm so long as the output obeys these mathematical constraints." Program synthesis techniques have improved and are improving, but they're still only really applicable to kernel-style transformations.

Darmani · on July 23, 2018

What's a kernel-style transformation? That term is unfamiliar to me? Are you referring to e.g.: optimizing stencil kernels?

You're correct that classical semantics-preserving compiler optimizations are very restricted, when not user guided as in the OP's tool. They do not have any form of specification weaker than the input code. This is why most synthesis tools do take something higher level than code.

Then again, some really do have the workflow "figure out existing code's algorithm, and pick a better one." The first thing in this space that comes to mind is verified lifting, being pursued by my collaborators at UW.

chriswarbo · on July 23, 2018

> It is not about being allowed at all.

> They typically fail as devising such structure safely from low level code is a really hard problem.

I took the phrase "being allowed to" as meaning "it is known to be safe to", in which case you seem to be agreeing with the parent (i.e. it's usually not allowed AKA it's usually not known to be safe).

Darmani · on July 23, 2018

This was a hot-topic 10-15 years ago. Then, over a few years, we went from "there has been little work applying computer algebra methods to program analysis and optimization" to "everything under the sun has already been done."

It's been about 3 years since I looked at this stuff, and I'm having trouble remembering the names of the people in this area (lots of Europeans). Most of the action is in finding ever-larger fragments of programming languages they can solve. The one in this article handles what are called "arithmetic programs;" a bunch of people have worked on handling restricted shades of recursive programs.

ychen306 · on July 23, 2018

Unfortunately it only works for toy examples.

> cpmoptimize.recompiler.RecompilationError: Can't optimize loop: Unsupported instruction (LOAD_ATTR, 'b') at line 12 in fib, file "test.py"

taeric · on July 23, 2018

It is funny to see the repeated squares algorithm every time it is rediscovered. Knuth, unsurprisingly, had one of his earlier papers involving it. He presented it as related to addition chains and shows that just repeated squaring is not always superior.

I'm curious what all came of this effort, seeing it is marked 2015. It is also fun to see benchmarks of this method used in executing fib.

zawerf · on July 23, 2018

It's not just the repeated squares algorithm though. The novel part about this post is the application of it.

For most languages, if you write a for loop, the compiler is not smart enough to generate anything that runs in sublinear time. It might do some loop unrolling, reordering, or just skip running the whole loop altogether if it's dead code.

But this is the first time I have seen something that can rewrite a loop to run in a different runtime complexity (specifically logarithmic time).

He relies on rewriting python byte code but there's no reason why let's say a C compiler can't automatically identify a loop that can be rewritten as a linear recurrence relation and thus has a matrix exponentiation form.

jcranmer · on July 23, 2018

Clang will optimize basic loops away:

https://godbolt.org/g/GMjKZm

phkahler · on July 23, 2018

... this is the first time I have seen something that can rewrite a loop to run in a different runtime complexity...

Same here, but I really want to know how often this helps in real programs.

taeric · on July 23, 2018

This isn't new, though. Just doesn't come up as often as folks would like. Consider benchmarks that get reduced to constant. Happens literally all the time.

This particular change is neat, and is ultimately keyed on seeing optimizations from really old math. Which is the only thing I was wanting to add.