Which, with store forwarding, can be shockingly cheap. You may not actually be hitting L1, and if you are, you're probably not hitting it synchronously.
Sure, and so is calling a function every handful of cycles. That's a big part of why compilers inline.
Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.
https://easyperf.net/blog/2018/03/09/Store-forwarding
and, section 15.10 of https://www.agner.org/optimize/microarchitecture.pdf