An adventure in trying to optimize math.Atan2 with Go assembly

dvt · on Aug 29, 2017

Although this post is interesting, the best way to optimize trig functions is generally via rational approximations[1] if you can live with the error (and the clamping). I remember finding a rational function to approximate a tanh-like soft clipper via a Padé approximator on (-1, 1) a few years ago[2]. See an equivalent rational approximation for atan2 here[3].

[1] http://www.embedded.com/design/other/4216719/Performing-effi...

[2] https://stackoverflow.com/questions/6118028/fast-hyperbolic-...

[3] https://www.dsprelated.com/showarticle/1052.php

raphlinus · on Aug 29, 2017

A bit OT but my favorite way to approximate tanh is to do g(f(x)) where f(x) is an odd polynomial and g(x) is 1/sqrt(1 + x^2). The recip sqrt can be computed very quickly using Newton approximation, and many SIMD instructions have an instruction for it. Highly recommended for audio applications.

jacobolus · on Aug 29, 2017

Something is definitely off about your recollection here. g(x) = 1/√(1+x²) is always positive, whereas tanh(x) is negative for negative values of x. So regardless of your f, g(f(x)) can’t ever be a correct approximation for negative x.

jwmerrill · on Aug 29, 2017

Looks like grandparent probably meant g(x) = x/√(1+x²)

Looks like f(x)=x+x^3/5 gets you two decimal digits everywhere [1], and you could improve that with more terms in the polynomial.

Edit: you can derive f(x) by doing a series expansion of tanh(x)/√(1-tanh(x)²) about x=0, and Mathematica tells me the first few terms are

f(x) = x + x^3/6 + x^5/120 + x^7/5040 + O(x^9)

[1] https://www.desmos.com/calculator/zbyll5oeqk

raphlinus · on Aug 30, 2017

Oops, yes, you're correct, I was typing from memory and didn't double-check.

And yes, the Taylor series is pretty good but you can usually squeeze a bit more accuracy-over-range by, eg, using an optimizer for the coefficients. Also, note that in your notebook you have x^3/5 but x^3/6 is probably what you wanted.

conistonwater · on Aug 29, 2017

Could be a typo, f(g(x)) with g(x)=x/√(1+x²) and f a polynomial makes more sense to me.

conistonwater · on Aug 29, 2017

Here's what atan looks like in openlibm: https://github.com/JuliaLang/openlibm/blob/1581174c85f7b645b... It's a degree-4 polynomial, with on division in the reduction atan2->atan, and at most one division in the argument reduction step in atan, giving an pretty accurate answer. By comparison, in [3] there is a degree-3 polynomial and the thing doesn't work outside [-1,1]. There is some kind of a tradeoff between accuracy and performance, but most of the time it's not worth it. Maybe if division is ludicrously expensive or something unusual like that. I suspect most performance gains that people are looking for come from omitting proper handling of special cases.

dr_zoidberg · on Aug 29, 2017

Hmmm... I'm a bit foggy on the specific quote, but in the Graphics Programming Black Book, specifically the part where Abrash covers writing fast code, he says something along the lines of:

> Avoid special cases, because a special case means slow code

So yeah, handling of special cases will bite you on the performance side. Sometimes it's inevitable, but if you have a lot of control (or you're very specific) of what values get passed as parameters, then you can get away with not handling special cases (or just handling the ones you care about).

conistonwater · on Aug 30, 2017

Yeah, but the people talking about finding their own interpolating polynomials and Pade approximations are looking for those approximations in the regular common cases, and omitting the special cases happens almost by accident. You could just take the same approximation used by openlibm or whatever and it's likely better than anything you yourself can come up with.

mooman219 · on Aug 29, 2017

I actually wanted a fast atan2 about 3 years ago in Java and held a small competition to see what people could come up with [1]. In plain old java, it looks like it performed very well, I would expect it outperformed what they did in the post using LUTs. My methodology for micro benchmarks was pretty awful back then, so there's reason for error. The fastest was just 3 branches and a memory lookup.

[1] http://www.java-gaming.org/topics/extremely-fast-atan2/36467...

rzzzt · on Aug 29, 2017

Additional magic is involved -- Math.atan2 is an intrinsic method in HotSpot, which will use a hand-coded implementation instead of the Java definition when possible: https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc

mooman219 · on Aug 29, 2017

Math.Atan2 was well outperformed by the custom implementations in the post I linked. The native version isn't well optimized. Apache's Atan2 is just as accurate and about twice as fast for example.

agnivade · on Aug 29, 2017

Thanks ! I will look into it :)

Jyaif · on Aug 29, 2017

It should be noted that if you are computing sin/cos after atan2, then it's likely that what you are doing could be done with some regular vector manipulation (normalisation, possibly dot product, cross product).

agnivade · on Aug 29, 2017

Not really. My algorithm just needs atan2.

Its a flight tracking algorithm which calculates velocity of the aircraft from the messages it broadcasts.

EDIT: > then it's likely that what you are doing could be done with some regular vector manipulation (normalisation, possibly dot product, cross product).

Yes, totally. Like I said, I have zero knowledge in assembly. And what you are talking about "vector manipulation" is totally beyond me ;) Just doing this part took a lot from me. :D

Maybe someone knowledgeable enough will be kind enough to write an improved implementation.

mirashii · on Aug 29, 2017

If it just needs atan2, it is not in the category described by the comment that you're replying to.

agnivade · on Aug 29, 2017

Sorry, I might have misunderstood something. But OP said - " if you are computing sin/cos after atan2,", and I don't need any sin/cos computation after atan2. Just the atan2.

recursive · on Aug 29, 2017

Right, so your situation doesn't fit the condition being described.

nickysielicki · on Aug 29, 2017

Are you writing a server for ADS-B mode S multilateration?

eg: https://github.com/mutability/mlat-server ?

agnivade · on Aug 31, 2017

Yes, something like that. It doesn't do multilateration yet. But processes ADS-B and Mode-S messages.

drv · on Aug 29, 2017

The assembly version is using the packed version of the FMA instruction (that's what the "P" in the mnemonic stands for), but as far as I can tell, it's only using one of the packed values, whereas the instruction can calculate two (AVX) or four (AVX2) FMA operations at once. It might be possible to get some speedup by rearranging the calculation so it can use the full width of the vector registers - at first glance, at least the two sides of the division should be possible to calculate in parallel with half as many FMA instructions.

CorvusCrypto · on Aug 29, 2017

If anyone was curious why the instruction address didn't match up, it's because he read the instruction identifier wrong. It's also totally counterintuitive and you have to essentially read the full documentation to get to any real information about it. Anyway here we go. VEX is a prefix notation with specialised encodings:

C4 - For a three-byte instruction (which vfmadd is) this is C4, if a two-byte instruction this would be C5.

E2 - This one is more involved but basically for this instruction the first three bits are 1 for setting some modes on it (R, X, and B). Then, because this is an 0f38H instruction, the next 5 bits are 00010. Altogether you get 11100010 (E2)

E9 - this is also involved, but the first bit is a W mode. This is 1 for the instruction he used but it's pretty much ignorable. for bits 2-5 it has to do with the first source register the instruction uses. This is 1101 because it uses XMM2/YMM2 (it's by lookup in a table) first. bit 6 is set to 0 if it is a 128 bit vector (it is). bit 7-8 are set to 01 based on a lookup table for the "pp" value which is 66 in the instruction identifier. Altogether that's 11101001 (E9).

A8 - this one is easy, it's the opcode.

C3 - this is the ModR/M byte which is actually used in this case for reporting the register/memory operands. first 2 bits are 11 to indicate the first operand is the register itself (not displaced, or truncated). The next 6 bytes are the register codes. 000 is actually not used since the destination is overriden for float maths), and the 011 is EBX as the base cpu register to source the data.

Altogether it works out to be C4E2E9A8C3. Not intuitive at all really.

Edit: Please someone correct me if I missed something. I hate this kind of stuff and I'm sure I made a mistake.

flavio81 · on Aug 29, 2017

> "The reason for optimizing the math.Atan2 function is because my current work involves performing some math calculations. And the math.Atan2 call was in the hot path. "

He should take a look at:

a) the range of x which will be used for atan2(x)

b) the level of precision he really needs

There is a chance he could get away with precomputing an atan2(x) table in memory and then just reading the values off the table.

Quick as lightning, and an old trick, probably invented in the 1950s!!

EDIT: I see i am potentially wrong, due to slower DRAM access compared to some built-in FP instructions...

arc776 · on Aug 30, 2017

> There is a chance he could get away with precomputing an atan2(x) table in memory and then just reading the values off the table. Quick as lightning, and an old trick, probably invented in the 1950s!!

I actually tried this for cos and sin lookups, and lerping between datapoints.

It was actually slower than the builtins!

Const-me · on Aug 30, 2017

> i am potentially wrong

Not just potentially. See this: https://github.com/Const-me/LookupTables

flavio81 · on Aug 31, 2017

Great link, you should submit it to HN.

Const-me · on Aug 31, 2017

Did that already, everyone ignored that.

https://news.ycombinator.com/item?id=12724732

analog31 · on Aug 30, 2017

I recently needed an atan2(x, y) function to decode a quadrature sine signal from a position sensor, in C on a microcontroller. I started with a single divide to get x/y or y/x (some branching to decide which) followed by a table lookup. I only needed about 9 bits of precision. It was a strictly amateur hack job, but surprisingly fast.

gfody · on Aug 29, 2017

these old tricks are still pretty effective especially if the table small enough to be cached on the cpu.. then there are even more tricks to keep the table small by chunking it and lerping between entries. ah memories!

flavio81 · on Aug 29, 2017

> lerping

I learnt a new slang today, thanks!

Yes, linear interpolation is a possibility. Another could be to use a quadratic (or nth-order) function f that approximates atan2(x) when x is within the range you need, so that you get a really good approximation with a quick and dirty computing.

In particular, if your equation uses powers of 2 and 4, the power operation perhaps can be done by bit shifting which is blazingly fasssssssssst.

geezerjay · on Aug 30, 2017

> I learnt a new slang today, thanks!

I had to google it to see what "lerping" was supposed to mean, and the whole experience was too disappointing. The lerping and tweening slang appears to be a reinvention of basic concepts (parametrization) derived from ignorance and poor math literacy. It appears that some people failed to pay attention to intro to calculus courses and decided to just come up with new names for tired old concepts.

gfody · on Aug 30, 2017

for myself I learned these techniques and corresponding slang as a kid following gamedev/demoscene tutorials and books like graphics gems, years before my mathematics education caught up. calculus was 3rd year high school (age 17ish) I had been familiar with the concepts from programming for 3 years already. it's a personal style thing but the slang stuck with me and I still use it in adulthood. I'm somewhat pleased that this would cause somebody like you to think somebody like me has poor math literacy ;)

jcranmer · on Aug 29, 2017

Not to be mean to the author, but this is an example of why it's generally useless to try to resort to assembly to beat a compiler unless you're an expert at assembly. The sort of case that's being optimized here--a straight-line piece of assembly--is the case where compilers are going to do better than hand-coded except in specific cases.

The first critical failure is using the source code to guide assembly optimization, rather than the original generated assembly. It means that you're going to miss what optimizations the compiler is doing to improve this, and missing those optimizations is going to turn out worse than the original.

There are two main optimizations that come to mind. The first is instruction scheduling. On newer CPUs (Haswell or later), you have two separate instruction ports for vector ALU instructions, so you can execute two instructions in parallel if there are no conflicts. If you interleave the instructions for computing the denominator and the numerator, you can take advantage of the twin execution ports and execute two instructions at once instead of two (you use more registers though). This is the main optimization that I'm sure the Go library is getting.

The other possible optimization is SLP vectorization: the numerators and denominators are both polynomial interpolation, so they're doing the same operations on data vectors. This means that you can stick both of them in the vector and use one instruction to compute the operation for numerator and denominator simultaneously. On the other hand, it's not immediately clear if the extra cost in vector formation is going to be worth the win in speed, particularly given that you can already simultaneously schedule the two instructions in the first operation. I don't know if Go is applying this optimization.

So if you're wondering why it's slower, it's almost certainly that the hand-written assembly is actually poorly scheduled.

dvt · on Aug 29, 2017

> So if you're wondering why it's slower, it's almost certainly that the hand-written assembly is actually poorly scheduled.

The assembly is slower because Go doesn't inline it. The overhead is obvious when calling one-op assembly blocks[1]. The proposal to inline assembly was rejected last year[2] by Russ Cox.

[1] https://lemire.me/blog/2016/12/21/performance-overhead-when-...

[2] https://github.com/golang/go/issues/17373

wolfgang42 · on Aug 29, 2017

Thanks for the link to the issue. I was wondering why the post was talking about function call overhead instead of being able to inline it.

agnivade · on Aug 29, 2017

> but this is an example of why it's generally useless to try to resort to assembly to beat a compiler unless you're an expert at assembly.

Yes I realized ! Expert ? Not even close. I barely know a thing or two in assembly. The fact that I even managed to make it run amazes me.

This was my first time venturing into the assembly land. It was just something which I wanted to give a shot and hopefully learn something at the end of the day, which I did.

The 2 optimizations that you are talking about - I am totally clueless about them :) It's totally possible that if whatever you are saying is done, we may see better gains. Hopefully, some good samaritan who is better at this than me will be willing to give this a shot and see where it gets us.

Waterluvian · on Aug 29, 2017

I dabbled a little in gaming with Unity. I discovered that atan2 is called a LOT. At least... I used it a lot.

Shouldn't something like this be implemented as a hardware level operation? Or is it? Or why can't it be? Or maybe it's not used nearly as much as I think?

jepler · on Aug 29, 2017

Sure, it was for decades. FPATAN in the x87 instruction set computes the 4-quadrant arctangent.

bazzargh · on Aug 29, 2017

Was curious about this as well, especially that go does use FPATAN on 386 but not Amd64 - found this bug which explains the issues involved:

https://github.com/golang/go/issues/17069#issuecomment-24637...

userbinator · on Aug 29, 2017

AMD was far behind Intel (factor of 2-3x slower) in x87 implementation performance for a long time, and continues to be; no doubt that was one of the factors leading them to try to deprecate it when they designed AMD64.

geezerjay · on Aug 30, 2017

Are you seriously trying to claim that AMD felt compelled to invent a whole new CPU architecture just because you believe it couldn't implement FPATAN right?

hhmc · on Aug 29, 2017

I think in the case that you're willing to trade some accuracy for speed, you'd just use a table lookup instead.

abainbridge · on Aug 29, 2017

Not necessarily. That's OK if you can guarantee that the table is in cache. But if it isn't, that's probably the slowest way.

DRAM read takes ~210 cycles (assuming 60ns access time and 3.5 GHz CPU clock rate).

FPATAN takes ~50 to 200 cycles, depending on the operand value and CPU model (see http://www.agner.org/optimize/instruction_tables.pdf)

I suspect a polynomial approximation takes less than either of these, since that's what most C standard library implementations use on the PC (I just check Visual Studio 2017).

Tomis02 · on Aug 29, 2017

> But if it isn't, that's probably the slowest way.

It's probably the slowest way the first time, but if you call atan2 in a loop, won't it be much faster?

abainbridge · on Aug 30, 2017

It isn't just "the first time" that is slow for the LUT solution.

Each time the table access hits a new cache line, there will be a stall. Say your LUT has 2048 64-bit entries, that's 256 cache lines. So the total stall time waiting for cache lines to be filled could be as bad as 210 cycles * 256 = 53760 cycles. In that amount of time, the std library function could have returned about 1000 results (assuming the std library takes 50 cycles).

And, as well as waiting for the table reads, the LUT based function has other work to do. If we are comparing performance to the std library function, it handles input from the entire range of IEEE 64-bit floats. It isn't practical to do that directly with a lookup table. You would need to clamp the range down to something manageable. Once that's done, if you want anything like decent accuracy you need to read two values from the table and do a linear interpolation between them. Or you could have a HUGE table instead of the lerp but it would have to be far too big to fit in L1 cache, and thus also quite slow.

My guess is a clamp and lerp'd lookup table will take ~10 cycles once the table is entirely in cache. So, I guess I'm claiming that the LUT would then be 5x faster than the std library.

I don't have a lot of confidence in that prediction though. There are lots of other things to worry about if we were to do this analysis properly, including: a) how the input pattern alters the effectiveness of the branch predictor, b) how exactly you are finding some input data to operate on without polluting the cache, c) whether we are measuring latency or throughput.

ajuc · on Aug 29, 2017

Look up tables are often slower on modern hardware than just calculating the stuff again (if there's a cache miss).

cpr · on Aug 29, 2017

Wow, this brings back vivid memories of late 80's, when I was assigned to write atan2 by hand on Multiflow's 28/14/7-wide VLIW machines.

(Everyone in the company who could hand-code (even us OS guys and gals ;-) was assigned a math routine--it was an all-hands-on-deck kind of performance situation in our competition with Alliant and Convex.)

Imagine trying to hand-schedule up to 28 operations at once, including all memory loads with 12 or 14 (forget now) clocks to memory, all bus contending had to be done by hand, etc, etc. What fun! That's why VLIW compilers were so challenging to write: all resources software-managed.

But, alas, in the end the mini-super market died. We all lost to Sun eventually, as everyone wanted his own machine, even if much less powerful--never bet against having the computer power closer to the user (starting with mainframes): minicomputer / workstation / PC / mobile / etc, right down the line to today.

(What's going to displace mobile, IoT? Doesn't seem likely. Perhaps when you reach the limit of what you can hold in your hand, that's the last form factor? I don't believe wearable computing will ever really catch on, but I'm old fustian.)

orf · on Aug 29, 2017

> I don't believe wearable computing will ever really catch on, but I'm old fustian.)

I used to think the same, but it seems to me that phones are the first step. A huge number of people have a mini computer on their persons all day. It's only a matter of time before it becomes even more integrated.

Will be interesting to see, but I doubt the current form factors will win. Smart watches seem so... not it.

kbenson · on Aug 29, 2017

And also, with wearables it's not so much bringing a device closer, but a personal network of devices. The phone may still have it's place as the larges person-network computing device, but peripherals with their own small compute power and additional I/O get added to the network. Smart watches (biometric inputs, small always ready screen output), glasses (AR overlay output, line-of-sight and image recognition input), etc.

skykooler · on Aug 29, 2017

I always find it interesting to read stories of experiments that didn't end up working.

aneutron · on Aug 29, 2017

IMO they are more important, because the OP learns more, and you acquire part of the experience without investing as much time.

gigatexal · on Aug 29, 2017

Dang. Props to the blogger for posting even though success wasn’t had. Hours on end doing assembly is a feat unto itself. I did not see the twist at the end, I was thinking for sure there’d be a huge speed up. But knowledge is knowledge.

agnivade · on Aug 29, 2017

Haha. I know. My heart was pounding when I started the final benchmark .. hoping against hope that there will be some gains.

But yea, knowledge is knowledge.

b4ux1t3 · on Aug 29, 2017

I would love to see more posts like this. Seeing your thought process when approaching a problem, even if you don't end up solving the problem, or don't do it well, is valuable insight into how other people work, and how I can possibly integrate some of your methods to my own workflow.

Thanks for posting this.

agnivade · on Aug 29, 2017

You're welcome ! Glad it was helpful. :)

MichaelBurge · on Aug 29, 2017

I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:

* math.Atan2 is in the standard library, so it should know it has no side effects and can be replaced with an empty statement. Unless Go has a way to mark a function as not having side effects, no user-defined function would have that property.

* The only observable impact of the loop if math.Atan2 is removed is that b.N could be a larger size type than n, so it could loop forever depending on the values. If they're known to be the same type, it could eliminate the loop entirely.

Are you sure you're compiling with optimizations enabled?

I also wonder if that DIVSD instruction in yours might cause a floating point exception, which would be a different behavior compared to atan2.

Kubuxu · on Aug 29, 2017

Some optimizations are disabled when Golang is running benchmarks.

iand · on Aug 29, 2017

I don't think that is correct. Do you have a link to more information about it.

When writing benchmarks in Go you always need to take steps to circumvent the optimizer detecting functions with no side effects. Assigning the result of a benchmark to a public package level variable is usually enough.

cormacrelf · on Aug 29, 2017

As in Rust, which has a test::black_box, which is unintelligible to the optimizer.

agnivade · on Aug 29, 2017

Author here.

> I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:

I'm sorry, could you clarify what you mean by "eliminate" here. Are you saying I shouldn't benchmark atan2 but something else ?

> Are you sure you're compiling with optimizations enabled?

I just did - `go test -bench=. -benchmem -cpu=1`. Did I miss something ?

> I also wonder if that DIVSD instruction in yours might cause a floating point exception, which would be a different behavior compared to atan2.

I checked with the assembly output of math.Atan2. It also uses DIVSD.

munificent · on Aug 29, 2017

> I'm sorry, could you clarify what you mean by "eliminate" here.

Most compilers will eliminate code if they can tell the results of a calculation are never used. Benchmarks typically call some function and discard the result. If you aren't careful, the compiler can often optimize away your entire benchmark down to nothing.

This is why you often see benchmarks that look like:

    int average(a, b) {
      return (a + b) / 2;
    }

    main() {
      int sum;
      for (int i = 0; i < 100000; i++) {
        for (int j = 0; j < 100000; j++) {
          sum += average(i, j);
        }
      }

      if (sum != 123) printf("!");
    }

Instead of:

    int average(a, b) {
      return (a + b) / 2;
    }

    main() {
      for (int i = 0; i < 100000; i++) {
        for (int j = 0; j < 100000; j++) {
          average(i, j);
        }
      }
    }

agnivade · on Aug 29, 2017

Ah I see ! Gotcha :)

Yes, it seems Go is smart enough not to eliminate those values entirely.

I think this might be a better way to be sure that the function will be computed entirely.

`_ = myatan2(-479, 123)`

The result is same though. I will update the post.

aleksi · on Aug 30, 2017

Even better is to make a global var sink interface{}, and assign to it. https://github.com/golang/go/issues/14813

conistonwater · on Aug 29, 2017

I don't know about go specifically, but in general if a compiler sees that the result of an expression is thrown away (like here: you don't use it anywhere), and there are no side effects to the expression, then it's okay to replace it with a noop. Usually it's better to write benchmarks in a way that the result of the mathematical expressions is used, like if you kept a running sum of `atan2(rand(), rand())` and then printed it.

sbierwagen · on Aug 29, 2017

>could you clarify what you mean by "eliminate" here.

He's talking about a common compiler optimization: https://en.wikipedia.org/wiki/Dead_code_elimination

Some compilers are smart enough to not run code that doesn't have any side effects. (Running a function a million times in a row but never doing anything with the results)

Hopefully golang bench is smart enough to not be that smart.

agnivade · on Aug 29, 2017

Got it. Thanks !

> Hopefully golang bench is smart enough to not be that smart.

Yes, that seems to be true.

kristianp · on Aug 29, 2017

I'm no optimisation expert, but I'm wondering if the FMAs are slow because the result of each one is dependent on the previous one? The dependency on the result may mean that the processor can't pipeline the operations. Could it be faster if the two chains of FMAs on either side of the division are interleaved and use different registers?

z := x * x

z = z * fma(fma(fma(fma(P0, z, P1), z, P2), z, P3), z, P4) / fma(fma(fma(fma((z+Q0), z, Q1), z, Q2), z, Q3), z, Q4)

z = fma(x,z,x)

This article is quite the nerd-snipe.

DannyBee · on Aug 31, 2017

It will definitely be faster on the newest intel processors, which have dedicated FMA units. In fact, to max out floating point on the things, you'd likely have to intermix FMA into the normal FP stream (IE FMA by 1 or FMA of something plus 0)

xioxox · on Aug 29, 2017

In C/C++ it would be much easier to call the intrinsic functions which represent the fma instructions. The compiler would then be able to inline and optimize the whole.

stephencanon · on Aug 29, 2017

In C or C++ you should be able to use the stdlib `fma( )` function and have the compiler lower the call to the instruction, if it isn't completely out to lunch. No intrinsics should be necessary.

Alternatively, set `#pragma STDC FP_CONTRACT ON` and just write `x*y + z`.

xioxox · on Aug 29, 2017

I didn't know they added that. The intrinsic could be useful to parallelize the computation using SSE/AVX.

agnivade · on Aug 29, 2017

Author here. Glad to answer any questions you guys have :)

SolarNet · on Aug 30, 2017

I think the primary thing you are missing to get the speed improvement is actually doing the packed computations (e.g. computing two values at once; e.g. the numerator and denominator).