Good analysis. I would amend: templates usually improve code efficiency, because the compiler can see through abstractions and generate (larger but) much faster code.
I think that's often true (one common example is stl sort versus C stdlib's qsort(), which is often a big win because of inlining a datatype-specific comparison operator), but I think there are quite a few cases where the object code bloat you get from multiplying the code by the number of types it's instantiated for (vs. using a polymorphic/generic function) kills your cache more than enough make up for any optimization win.