Even though it's one of my favourites, modern high performance hardware with branch prediction, out-of-order execution, and multiple issue generally makes this trick rather unnecessary. Combine that with the cleverer compilers we have these days and you get pretty much the same performance out of the simple copy.
Even worse is that modern x86 (and probably other) CPUs also have instructions for 16-byte vector registers that can be used to copy or compare data much faster than 1 byte per cycle. Recent versions of glibc use some linker magic to pick the optimal code to use for strcmp, memcpy and friends based on the instruction set available to the CPU at runtime. Of course gcc and glibcxx's developers must not trust glibc and will sometimes replace calls to these functions with thier "optimized" builtin versions that use the lowest common denominator ISA. An easy way we got a 5% boost in througput in mongodb was to force them to call into glibc. https://github.com/mongodb/mongo/blob/master/pch.h#L47-56
Had a need for it recently, and found out it was not available in C#.