Although this post is interesting, the best way to optimize trig functions is generally via rational approximations[1] if you can live with the error (and the clamping). I remember finding a rational function to approximate a tanh-like soft clipper via a Padé approximator on (-1, 1) a few years ago[2]. See an equivalent rational approximation for atan2 here[3].
A bit OT but my favorite way to approximate tanh is to do g(f(x)) where f(x) is an odd polynomial and g(x) is 1/sqrt(1 + x^2). The recip sqrt can be computed very quickly using Newton approximation, and many SIMD instructions have an instruction for it. Highly recommended for audio applications.
Something is definitely off about your recollection here. g(x) = 1/√(1+x²) is always positive, whereas tanh(x) is negative for negative values of x. So regardless of your f, g(f(x)) can’t ever be a correct approximation for negative x.
Oops, yes, you're correct, I was typing from memory and didn't double-check.
And yes, the Taylor series is pretty good but you can usually squeeze a bit more accuracy-over-range by, eg, using an optimizer for the coefficients. Also, note that in your notebook you have x^3/5 but x^3/6 is probably what you wanted.
Here's what atan looks like in openlibm: https://github.com/JuliaLang/openlibm/blob/1581174c85f7b645b... It's a degree-4 polynomial, with on division in the reduction atan2->atan, and at most one division in the argument reduction step in atan, giving an pretty accurate answer. By comparison, in [3] there is a degree-3 polynomial and the thing doesn't work outside [-1,1]. There is some kind of a tradeoff between accuracy and performance, but most of the time it's not worth it. Maybe if division is ludicrously expensive or something unusual like that. I suspect most performance gains that people are looking for come from omitting proper handling of special cases.
Hmmm... I'm a bit foggy on the specific quote, but in the Graphics Programming Black Book, specifically the part where Abrash covers writing fast code, he says something along the lines of:
> Avoid special cases, because a special case means slow code
So yeah, handling of special cases will bite you on the performance side. Sometimes it's inevitable, but if you have a lot of control (or you're very specific) of what values get passed as parameters, then you can get away with not handling special cases (or just handling the ones you care about).
Yeah, but the people talking about finding their own interpolating polynomials and Pade approximations are looking for those approximations in the regular common cases, and omitting the special cases happens almost by accident. You could just take the same approximation used by openlibm or whatever and it's likely better than anything you yourself can come up with.
I actually wanted a fast atan2 about 3 years ago in Java and held a small competition to see what people could come up with [1]. In plain old java, it looks like it performed very well, I would expect it outperformed what they did in the post using LUTs. My methodology for micro benchmarks was pretty awful back then, so there's reason for error. The fastest was just 3 branches and a memory lookup.
Additional magic is involved -- Math.atan2 is an intrinsic method in HotSpot, which will use a hand-coded implementation instead of the Java definition when possible: https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc
Math.Atan2 was well outperformed by the custom implementations in the post I linked. The native version isn't well optimized. Apache's Atan2 is just as accurate and about twice as fast for example.
It should be noted that if you are computing sin/cos after atan2, then it's likely that what you are doing could be done with some regular vector manipulation (normalisation, possibly dot product, cross product).
Its a flight tracking algorithm which calculates velocity of the aircraft from the messages it broadcasts.
EDIT: > then it's likely that what you are doing could be done with some regular vector manipulation (normalisation, possibly dot product, cross product).
Yes, totally. Like I said, I have zero knowledge in assembly. And what you are talking about "vector manipulation" is totally beyond me ;) Just doing this part took a lot from me. :D
Maybe someone knowledgeable enough will be kind enough to write an improved implementation.
Sorry, I might have misunderstood something. But OP said - " if you are computing sin/cos after atan2,", and I don't need any sin/cos computation after atan2. Just the atan2.
The assembly version is using the packed version of the FMA instruction (that's what the "P" in the mnemonic stands for), but as far as I can tell, it's only using one of the packed values, whereas the instruction can calculate two (AVX) or four (AVX2) FMA operations at once. It might be possible to get some speedup by rearranging the calculation so it can use the full width of the vector registers - at first glance, at least the two sides of the division should be possible to calculate in parallel with half as many FMA instructions.
If anyone was curious why the instruction address didn't match up, it's because he read the instruction identifier wrong. It's also totally counterintuitive and you have to essentially read the full documentation to get to any real information about it. Anyway here we go. VEX is a prefix notation with specialised encodings:
C4 - For a three-byte instruction (which vfmadd is) this is C4, if a two-byte instruction this would be C5.
E2 - This one is more involved but basically for this instruction the first three bits are 1 for setting some modes on it (R, X, and B). Then, because this is an 0f38H instruction, the next 5 bits are 00010. Altogether you get 11100010 (E2)
E9 - this is also involved, but the first bit is a W mode. This is 1 for the instruction he used but it's pretty much ignorable. for bits 2-5 it has to do with the first source register the instruction uses. This is 1101 because it uses XMM2/YMM2 (it's by lookup in a table) first. bit 6 is set to 0 if it is a 128 bit vector (it is). bit 7-8 are set to 01 based on a lookup table for the "pp" value which is 66 in the instruction identifier. Altogether that's 11101001 (E9).
A8 - this one is easy, it's the opcode.
C3 - this is the ModR/M byte which is actually used in this case for reporting the register/memory operands. first 2 bits are 11 to indicate the first operand is the register itself (not displaced, or truncated). The next 6 bytes are the register codes. 000 is actually not used since the destination is overriden for float maths), and the 011 is EBX as the base cpu register to source the data.
Altogether it works out to be C4E2E9A8C3. Not intuitive at all really.
Edit: Please someone correct me if I missed something. I hate this kind of stuff and I'm sure I made a mistake.
> "The reason for optimizing the math.Atan2 function is because my current work involves performing some math calculations. And the math.Atan2 call was in the hot path. "
He should take a look at:
a) the range of x which will be used for atan2(x)
b) the level of precision he really needs
There is a chance he could get away with precomputing an atan2(x) table in memory and then just reading the values off the table.
Quick as lightning, and an old trick, probably invented in the 1950s!!
EDIT: I see i am potentially wrong, due to slower DRAM access compared to some built-in FP instructions...
> There is a chance he could get away with precomputing an atan2(x) table in memory and then just reading the values off the table.
Quick as lightning, and an old trick, probably invented in the 1950s!!
I actually tried this for cos and sin lookups, and lerping between datapoints.
I recently needed an atan2(x, y) function to decode a quadrature sine signal from a position sensor, in C on a microcontroller. I started with a single divide to get x/y or y/x (some branching to decide which) followed by a table lookup. I only needed about 9 bits of precision. It was a strictly amateur hack job, but surprisingly fast.
these old tricks are still pretty effective especially if the table small enough to be cached on the cpu.. then there are even more tricks to keep the table small by chunking it and lerping between entries. ah memories!
Yes, linear interpolation is a possibility. Another could be to use a quadratic (or nth-order) function f that approximates atan2(x) when x is within the range you need, so that you get a really good approximation with a quick and dirty computing.
In particular, if your equation uses powers of 2 and 4, the power operation perhaps can be done by bit shifting which is blazingly fasssssssssst.
I had to google it to see what "lerping" was supposed to mean, and the whole experience was too disappointing. The lerping and tweening slang appears to be a reinvention of basic concepts (parametrization) derived from ignorance and poor math literacy. It appears that some people failed to pay attention to intro to calculus courses and decided to just come up with new names for tired old concepts.
for myself I learned these techniques and corresponding slang as a kid following gamedev/demoscene tutorials and books like graphics gems, years before my mathematics education caught up. calculus was 3rd year high school (age 17ish) I had been familiar with the concepts from programming for 3 years already. it's a personal style thing but the slang stuck with me and I still use it in adulthood. I'm somewhat pleased that this would cause somebody like you to think somebody like me has poor math literacy ;)
Not to be mean to the author, but this is an example of why it's generally useless to try to resort to assembly to beat a compiler unless you're an expert at assembly. The sort of case that's being optimized here--a straight-line piece of assembly--is the case where compilers are going to do better than hand-coded except in specific cases.
The first critical failure is using the source code to guide assembly optimization, rather than the original generated assembly. It means that you're going to miss what optimizations the compiler is doing to improve this, and missing those optimizations is going to turn out worse than the original.
There are two main optimizations that come to mind. The first is instruction scheduling. On newer CPUs (Haswell or later), you have two separate instruction ports for vector ALU instructions, so you can execute two instructions in parallel if there are no conflicts. If you interleave the instructions for computing the denominator and the numerator, you can take advantage of the twin execution ports and execute two instructions at once instead of two (you use more registers though). This is the main optimization that I'm sure the Go library is getting.
The other possible optimization is SLP vectorization: the numerators and denominators are both polynomial interpolation, so they're doing the same operations on data vectors. This means that you can stick both of them in the vector and use one instruction to compute the operation for numerator and denominator simultaneously. On the other hand, it's not immediately clear if the extra cost in vector formation is going to be worth the win in speed, particularly given that you can already simultaneously schedule the two instructions in the first operation. I don't know if Go is applying this optimization.
So if you're wondering why it's slower, it's almost certainly that the hand-written assembly is actually poorly scheduled.
> So if you're wondering why it's slower, it's almost certainly that the hand-written assembly is actually poorly scheduled.
The assembly is slower because Go doesn't inline it. The overhead is obvious when calling one-op assembly blocks[1]. The proposal to inline assembly was rejected last year[2] by Russ Cox.
> but this is an example of why it's generally useless to try to resort to assembly to beat a compiler unless you're an expert at assembly.
Yes I realized ! Expert ? Not even close. I barely know a thing or two in assembly. The fact that I even managed to make it run amazes me.
This was my first time venturing into the assembly land. It was just something which I wanted to give a shot and hopefully learn something at the end of the day, which I did.
The 2 optimizations that you are talking about - I am totally clueless about them :) It's totally possible that if whatever you are saying is done, we may see better gains. Hopefully, some good samaritan who is better at this than me will be willing to give this a shot and see where it gets us.
I dabbled a little in gaming with Unity. I discovered that atan2 is called a LOT. At least... I used it a lot.
Shouldn't something like this be implemented as a hardware level operation? Or is it? Or why can't it be? Or maybe it's not used nearly as much as I think?
AMD was far behind Intel (factor of 2-3x slower) in x87 implementation performance for a long time, and continues to be; no doubt that was one of the factors leading them to try to deprecate it when they designed AMD64.
Are you seriously trying to claim that AMD felt compelled to invent a whole new CPU architecture just because you believe it couldn't implement FPATAN right?
I suspect a polynomial approximation takes less than either of these, since that's what most C standard library implementations use on the PC (I just check Visual Studio 2017).
It isn't just "the first time" that is slow for the LUT solution.
Each time the table access hits a new cache line, there will be a stall. Say your LUT has 2048 64-bit entries, that's 256 cache lines. So the total stall time waiting for cache lines to be filled could be as bad as 210 cycles * 256 = 53760 cycles. In that amount of time, the std library function could have returned about 1000 results (assuming the std library takes 50 cycles).
And, as well as waiting for the table reads, the LUT based function has other work to do. If we are comparing performance to the std library function, it handles input from the entire range of IEEE 64-bit floats. It isn't practical to do that directly with a lookup table. You would need to clamp the range down to something manageable. Once that's done, if you want anything like decent accuracy you need to read two values from the table and do a linear interpolation between them. Or you could have a HUGE table instead of the lerp but it would have to be far too big to fit in L1 cache, and thus also quite slow.
My guess is a clamp and lerp'd lookup table will take ~10 cycles once the table is entirely in cache. So, I guess I'm claiming that the LUT would then be 5x faster than the std library.
I don't have a lot of confidence in that prediction though. There are lots of other things to worry about if we were to do this analysis properly, including: a) how the input pattern alters the effectiveness of the branch predictor, b) how exactly you are finding some input data to operate on without polluting the cache, c) whether we are measuring latency or throughput.
Wow, this brings back vivid memories of late 80's, when I was assigned to write atan2 by hand on Multiflow's 28/14/7-wide VLIW machines.
(Everyone in the company who could hand-code (even us OS guys and gals ;-) was assigned a math routine--it was an all-hands-on-deck kind of performance situation in our competition with Alliant and Convex.)
Imagine trying to hand-schedule up to 28 operations at once, including all memory loads with 12 or 14 (forget now) clocks to memory, all bus contending had to be done by hand, etc, etc. What fun! That's why VLIW compilers were so challenging to write: all resources software-managed.
But, alas, in the end the mini-super market died. We all lost to Sun eventually, as everyone wanted his own machine, even if much less powerful--never bet against having the computer power closer to the user (starting with mainframes): minicomputer / workstation / PC / mobile / etc, right down the line to today.
(What's going to displace mobile, IoT? Doesn't seem likely. Perhaps when you reach the limit of what you can hold in your hand, that's the last form factor? I don't believe wearable computing will ever really catch on, but I'm old fustian.)
> I don't believe wearable computing will ever really catch on, but I'm old fustian.)
I used to think the same, but it seems to me that phones are the first step. A huge number of people have a mini computer on their persons all day. It's only a matter of time before it becomes even more integrated.
Will be interesting to see, but I doubt the current form factors will win. Smart watches seem so... not it.
And also, with wearables it's not so much bringing a device closer, but a personal network of devices. The phone may still have it's place as the larges person-network computing device, but peripherals with their own small compute power and additional I/O get added to the network. Smart watches (biometric inputs, small always ready screen output), glasses (AR overlay output, line-of-sight and image recognition input), etc.
Dang. Props to the blogger for posting even though success wasn’t had. Hours on end doing assembly is a feat unto itself. I did not see the twist at the end, I was thinking for sure there’d be a huge speed up. But knowledge is knowledge.
I would love to see more posts like this. Seeing your thought process when approaching a problem, even if you don't end up solving the problem, or don't do it well, is valuable insight into how other people work, and how I can possibly integrate some of your methods to my own workflow.
I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:
* math.Atan2 is in the standard library, so it should know it has no side effects and can be replaced with an empty statement. Unless Go has a way to mark a function as not having side effects, no user-defined function would have that property.
* The only observable impact of the loop if math.Atan2 is removed is that b.N could be a larger size type than n, so it could loop forever depending on the values. If they're known to be the same type, it could eliminate the loop entirely.
Are you sure you're compiling with optimizations enabled?
I also wonder if that DIVSD instruction in yours might cause a floating point exception, which would be a different behavior compared to atan2.
I don't think that is correct. Do you have a link to more information about it.
When writing benchmarks in Go you always need to take steps to circumvent the optimizer detecting functions with no side effects. Assigning the result of a benchmark to a public package level variable is usually enough.
> I'm sorry, could you clarify what you mean by "eliminate" here.
Most compilers will eliminate code if they can tell the results of a calculation are never used. Benchmarks typically call some function and discard the result. If you aren't careful, the compiler can often optimize away your entire benchmark down to nothing.
This is why you often see benchmarks that look like:
int average(a, b) {
return (a + b) / 2;
}
main() {
int sum;
for (int i = 0; i < 100000; i++) {
for (int j = 0; j < 100000; j++) {
sum += average(i, j);
}
}
if (sum != 123) printf("!");
}
Instead of:
int average(a, b) {
return (a + b) / 2;
}
main() {
for (int i = 0; i < 100000; i++) {
for (int j = 0; j < 100000; j++) {
average(i, j);
}
}
}
I don't know about go specifically, but in general if a compiler sees that the result of an expression is thrown away (like here: you don't use it anywhere), and there are no side effects to the expression, then it's okay to replace it with a noop. Usually it's better to write benchmarks in a way that the result of the mathematical expressions is used, like if you kept a running sum of `atan2(rand(), rand())` and then printed it.
Some compilers are smart enough to not run code that doesn't have any side effects. (Running a function a million times in a row but never doing anything with the results)
Hopefully golang bench is smart enough to not be that smart.
I'm no optimisation expert, but I'm wondering if the FMAs are slow because the result of each one is dependent on the previous one? The dependency on the result may mean that the processor can't pipeline the operations. Could it be faster if the two chains of FMAs on either side of the division are interleaved and use different registers?
It will definitely be faster on the newest intel processors, which have dedicated FMA units.
In fact, to max out floating point on the things, you'd likely have to intermix FMA into the normal FP stream (IE FMA by 1 or FMA of something plus 0)
In C/C++ it would be much easier to call the intrinsic functions which represent the fma instructions. The compiler would then be able to inline and optimize the whole.
In C or C++ you should be able to use the stdlib `fma( )` function and have the compiler lower the call to the instruction, if it isn't completely out to lunch. No intrinsics should be necessary.
Alternatively, set `#pragma STDC FP_CONTRACT ON` and just write `x*y + z`.
I think the primary thing you are missing to get the speed improvement is actually doing the packed computations (e.g. computing two values at once; e.g. the numerator and denominator).
[1] http://www.embedded.com/design/other/4216719/Performing-effi...
[2] https://stackoverflow.com/questions/6118028/fast-hyperbolic-...
[3] https://www.dsprelated.com/showarticle/1052.php