The general rule to follow in power consumption on CPUs is to do your work quickly and then get to sleep. Propagating clock is going to eat the bulk of your power. The mild difference between multiply and add in actual usage is inside the noise (orders of magnitude smaller). The bigger penalty in this case is the inter-iteration dependency, which, vectorized or not, runs the risk of holding up the whole show due to pipelining.
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
Have fun threading the needle!