If they are not in the critical path, it doesn't matter. There is no instruction cache issues as the loop is tiny. Also as the loop is tiny it will fit in the u-op cache (or even in the loop cache), so decoding is not an issue either. The only problem is potential lack of vectorization, but a good vector ISA can in principle handle the bound checking with masked reads and writes (but now the check is no longer a predictable branch, but it might end up in the critical path, although it is not necessarily a big cost, or even measurable, anyway).
Forget about the second order effects. The reason the extra instructions in first approximation do not matter is that loops typically are limited by carried loop dependencies.
Think about this: a machine with infinite execution units and memory bandwidth, potentially could execute all iterations of a loop at the same time, in parallel.
Unless each loop iteration depends somehow on the result of the previous iteration. Then only independent instructions of that iteration can execute in parallel and the loop is latency-chain bound (especially when it involves memory accesses). This is often the case. Because branch prediction breaks dependencies, bound checking is never part of a dependency chain, so it is often free or nearly so. For more optimized code, the assumption of infinite resources is of course not warranted and execution bandwidth and possibly even memory bandwidth need to be taken into consideration.