Hacker News new | past | comments | ask | show | jobs | submit login

I'm no optimisation expert, but I'm wondering if the FMAs are slow because the result of each one is dependent on the previous one? The dependency on the result may mean that the processor can't pipeline the operations. Could it be faster if the two chains of FMAs on either side of the division are interleaved and use different registers?

z := x * x

z = z * fma(fma(fma(fma(P0, z, P1), z, P2), z, P3), z, P4) / fma(fma(fma(fma((z+Q0), z, Q1), z, Q2), z, Q3), z, Q4)

z = fma(x,z,x)

This article is quite the nerd-snipe.




It will definitely be faster on the newest intel processors, which have dedicated FMA units. In fact, to max out floating point on the things, you'd likely have to intermix FMA into the normal FP stream (IE FMA by 1 or FMA of something plus 0)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: