I heard someone use those instructions once as examples of something compilers could do better than humans writing assembly -- Apple's MPW C compilers for PowerPC were capable of peephole optimizations that would produce them where a human might not think of them. (At least, that was the argument.)
That depends whether you mean a human who knows the instructions exist or not or a human who hasn't worked out how to use shifts to do integers mul/div by 2 yet.
The proper argument was always that optimizing compilers generate better assembly than 90% of the people using them could generate, and in a fraction of the time.
However these things often get turned into stronger (or different) arguments as they pass from mouth to ear repeatedly.
Sometimes they change completely, as in "the plural of anecdote is data"
I wanted to write a memcpy() routine for a microcontroller. I wrote a naive version where I copied from src to dst one byte at a time. You can find algorithms which are more efficient than this, which will typically copy 32 bit words at a time.
The interesting thing is, I turned on compiler optimisations. When I examined the assembled output (even though my knowledge of assembly is poor), I discovered that it had made the optimisations that you would find in a more complex C implementation. The compiler obviously thought to itself "I see what you're doing here", and put in a better version.
So the moral of the story is: your compiler is likely to be able to figure out a lot.
Even ignoring the usual optimizations like using SIMD and loop unrolling to find parallelism when doing memcpy, the compiler actually has techniques for spotting certain loop idioms so it can actually replace the loop with a memcpy library call if it deems it profitable (e.g. tell it it's likely to have N>bigNumber and it'll go for a library)
There are additional optimizations like using C's printf without any extra arguments, the compiler will replace that with a call to puts, which doesn't have the formatting code. You can see this in Compiler Explorer.
Quite often, that doesn't end up very efficient, because without "restrict", the result has to be identical to what it would be if it was copied byte by byte, for all possible overlaps of the two inputs.
Lots of memcpy() implementations are still more efficient than a dumb byte-by-byte copy. They'll copy the (unaligned) head and the tail in bytes, but the bulk of the data using whatever data type and method is fastest.