I think in the case that you're willing to trade some accuracy for speed, you'd ...

abainbridge · on Aug 29, 2017

Not necessarily. That's OK if you can guarantee that the table is in cache. But if it isn't, that's probably the slowest way.

DRAM read takes ~210 cycles (assuming 60ns access time and 3.5 GHz CPU clock rate).

FPATAN takes ~50 to 200 cycles, depending on the operand value and CPU model (see http://www.agner.org/optimize/instruction_tables.pdf)

I suspect a polynomial approximation takes less than either of these, since that's what most C standard library implementations use on the PC (I just check Visual Studio 2017).

Tomis02 · on Aug 29, 2017

> But if it isn't, that's probably the slowest way.

It's probably the slowest way the first time, but if you call atan2 in a loop, won't it be much faster?

abainbridge · on Aug 30, 2017

It isn't just "the first time" that is slow for the LUT solution.

Each time the table access hits a new cache line, there will be a stall. Say your LUT has 2048 64-bit entries, that's 256 cache lines. So the total stall time waiting for cache lines to be filled could be as bad as 210 cycles * 256 = 53760 cycles. In that amount of time, the std library function could have returned about 1000 results (assuming the std library takes 50 cycles).

And, as well as waiting for the table reads, the LUT based function has other work to do. If we are comparing performance to the std library function, it handles input from the entire range of IEEE 64-bit floats. It isn't practical to do that directly with a lookup table. You would need to clamp the range down to something manageable. Once that's done, if you want anything like decent accuracy you need to read two values from the table and do a linear interpolation between them. Or you could have a HUGE table instead of the lerp but it would have to be far too big to fit in L1 cache, and thus also quite slow.

My guess is a clamp and lerp'd lookup table will take ~10 cycles once the table is entirely in cache. So, I guess I'm claiming that the LUT would then be 5x faster than the std library.

I don't have a lot of confidence in that prediction though. There are lots of other things to worry about if we were to do this analysis properly, including: a) how the input pattern alters the effectiveness of the branch predictor, b) how exactly you are finding some input data to operate on without polluting the cache, c) whether we are measuring latency or throughput.

ajuc · on Aug 29, 2017

Look up tables are often slower on modern hardware than just calculating the stuff again (if there's a cache miss).