I suspect you will find its sorta unnecessary, most modern machines have stream detection built into the caches as well, and simple memcpy's and the like won't actually roll the entire cache. They might roll a set, or simply stop filling after some number of cache lines. This is also why prefetch hints don't tend to help streaming operations.
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.