There are trade-offs (constant time, perhaps?) and many differing applications ...
For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).
The setup takes a few cyces but once primed it's
aquired data -> ( pipeline ) -> processed data
every clock cycle (with a pipeline delay, of course).
In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.
( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )
For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).
The setup takes a few cyces but once primed it's
aquired data -> ( pipeline ) -> processed data
every clock cycle (with a pipeline delay, of course).
In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.
( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )