> In software, you'd normally iterate Newton's Method
Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.
> how does that hardware implementation work internally?
I don’t know, but based on performance difference between FP32 and FP64 square root instructions, the implementation probably produces 4-5 bits of mantissa per cycle.
Interesting that your linked algorithm manages to avoid costly divisions, but it uses an even longer loop than Newton's Method -- one iteration for every 2 bits. NM converges quadratically, doubling the number of correct bits each time, so a 32-bit number won't ever need more than 5 iterations, assuming >= 1 bit of accuracy in the initial estimate, which is easy to obtain just by shifting right by half the offset of the highest 1-bit.
There are trade-offs (constant time, perhaps?) and many differing applications ...
For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).
The setup takes a few cyces but once primed it's
aquired data -> ( pipeline ) -> processed data
every clock cycle (with a pipeline delay, of course).
In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.
( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )
sqrt is the one exception to this. the newton series is really good and the polynomials aren't great (and the newton based approach prevents you from having to do range reduction)
Square roots are implemented in hardware: https://www.felixcloutier.com/x86/sqrtsd
> In software, you'd normally iterate Newton's Method
Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.